BUILDING A PREDICTIVE MODEL FOR DETECTING DIABETES MELLITUS IN PATIENTS

ABSTRACT

## This term project paper studies a serious disease -- Diabetes Mellitus (Type II) -- widespread in the world. Diabetes can lead to a variety of health complications such as permanent blindness caused by glaucoma of the eyes. Diabetes is also a hereditary condition in many cases and depends upon a variety of physical factors of the human body. This research project explores some of the physical factors of the human body to build a predictive model for detecting diabetes mellitus in patients based upon real medical records obtained from a small population in the United States. It will help researchers to develop better predictive models for detecting diabetes mellitus in patients in future.

INTRODUCTION

## Diabetes Mellitus is a serious disease in which the body’s ability to produce or respond to the hormone insulin is impaired, resulting in abnormal metabolism of carbohydrates and elevated levels of glucose in the blood and urine [1]. It begins with insulin resistance, which is a condition in which the body's cells fail to respond to the hormone insulin in a proper manner. As this disease progresses, the patient may also develop a condition in which their body no longer produces insulin. Though hereditary factors may play a role in the onset of diabetes, the most common causes are excess of body weight, typically body mass index (BMI) and lack of proper exercise [2].
## 
## It is estimated that over 400 million people worldwide have diabetes mellitus with 90% of cases being diagnosed as type-II [3, 4, 5]. One thing about diabetes mellitus is that it is not gender specific as the rate of diabetes is nearly the same for men as that for women [6]. The global economic cost for diabetes is above USD 600 Billion with the cost being above USD 200 Billion in the USA alone [7, 8]. This research project examines some physical body parameters obtained from patients in the United States to build a predictive model for detecting the onset of diabetes in a certain population of people based in the United States using supervised and unsupervised machine learning techniques such as classification, regression, clustering and neural network analysis [9].

PROBLEM DEFINITION AND DATA

## This is an exploratory data analysis and machine learning problem on a dataset obtained from data.world [10]. The original dataset is available in the UCI Machine Learning Repository [11]. Exploratory analysis should be performed to engage in a preliminary analysis of the available data. Preliminary tests help discover relationships amongst the variables of the dataset. These relationships help us decide the important features of the dataset. Each feature is basically a variable or an attribute of the dataset. The attributes of the dataset can be ranked according to their relative importance so that they can be selected as features of the predictive model. My approach uses both supervised and unsupervised learning techniques such as regression or classification and k-means or neural networks, respectively.
##  
## The diabetes dataset has been obtained from data.world. It consists of nine attributes: Pregnancies, Glucose, Blood Pressure, Skin Thickness, Insulin, Body Mass Index (BMI), Diabetes Pedigree Function, Age and Outcome. These attributes are explained below in more detail [10, 11] --
## 1. Pregnancies: This indicates the number of times the female subject has been pregnant.
## 2. Glucose: This indicates the plasma glucose concentration every 2 hours in an oral glucose tolerance test.
## 3. Blood Pressure: This indicates the diastolic blood pressure of the female subject (in mm Hg).
## 4. Skin Thickness: This indicates the triceps skin fold thickness for the female subject (in mm).
## 5. Insulin: This indicates the female subject's serum insulin level every 2 hours (mu U/ml).
## 6. Body Mass Index (BMI): This indicates the body mass index of the female subject (weight in kg/(height in m)^2).
## 7. Diabetes Pedigree Function: This is a measure of the genetic influence/hereditary risk to the onset of diabetes mellitus (pedi).
## 8. Outcome: This indicates whether a female subject was diagnosed with diabetes mellitus or not. A value of 1 indicates that the female subject tested positive, whereas a value of 0 indicates that the female subject tested negative.

METHODOLOGY

## HYPOTHESIS / BASELINE: We are not sure which factors contribute to the onset of diabetes mellitus in patients and hence cannot build predictive models for detecting diabetes in patients.
## This problem can be explored in a number of ways. For the scope of this project, I have split my problem-solving approach into multiple steps, from loading the required packages to comparing the final results. The first step is to install and load the required packages if they are not installed in R Studio. The second step is to load the dataset to study the dimensions, structure, summary and distribution of its variables. The third step is to remove the missing values from the dataset as these can affect our analysis in a negative manner. The fourth step is to study the skewness of the variables of the dataset using histogram analysis. The fifth step is to study the boxplots and density curves of some important variables such as Diabetes Pedigree Function and Plasma Glucose. The sixth step is to examine the correlation of variance for each pair of variables of the dataset and note the pairs that have high correlation. The seventh step is an optional one - I have used other techniques to plot the correlation of variance of the variables graphically. The eigth step is to perform an exploratory analysis on the different age groups using techniques such as scatterplots, lined bar plots, stacked bar plots, box plots and classification pair plots. The ninth step is to perform dimensonality reduction on the variables of the dataset using techniques such as t-distributed stochastic neighbor embedding (t-SNE) and principal component analysis (PCA). The tenth step is to split the dataset into training and testing sets using a suitable training to testing ratio. The eleventh step is to train and test a set of predictive models using supervised and unsupervised machine learning techniques such as classification and clustering, respectively. In this regard, it is important to select the right features for the predictive model. The final step is to improve the models and report the results in order to determine the best machine learning technique. The results include parameters such as model accuracy, sensitivity and specificity. These results are discussed in a later section of this report [12].

RESULTS / OBSERVATIONS

## Below are the results and observations from my analysis --

Step 1: Loading the Packages for the Purpose of Analysis

## Loading all the packages for the purpose of analysis. If a package is not present, the code tries to download and install the package, so please make sure that you are connected to the Internet.
packages_vector <- c("tidyr", "gridExtra", "e1071", "MASS", "PerformanceAnalytics", "pysch", "ggplot2", "GGally", "ggcorrplot", "Rtsne", "ggthemes", "rvest", "factoextra", "graphics", "corrplot", "mclust", "caret", "C50", "stats", "cluster", "matrixStats", "rpart", "rpart.plot", "RWeka", "randomForest", "neuralnet", "kernlab", "party", "class", "gbm", "ada", "TTR", "highcharter", "knitr", "kableExtra")
packages_to_install <- packages_vector[!(packages_vector %in% installed.packages()[,"Package"])]
if(length(packages_to_install)) install.packages(packages_to_install, repos = "http://cran.us.r-project.org")

options("java.home"="/Library/Java/JavaVirtualMachines/jdk-9.0.1.jdk/Contents/Home/lib")
Sys.setenv(LD_LIBRARY_PATH='$JAVA_HOME/server')
dyn.load('/Library/Java/JavaVirtualMachines/jdk-9.0.1.jdk/Contents/Home/lib/server/libjvm.dylib')

library(tidyr)
library(gridExtra)
library(e1071)
library(MASS)
library(PerformanceAnalytics)
library(psych)
library(ggplot2)
library(GGally)
library(ggcorrplot)
library(Rtsne)
library(ggthemes)
library(rvest)
library(factoextra)
library(graphics)
library(corrplot)
library(mclust)
library(caret)
library(C50)
library(stats)
library(cluster)
library(matrixStats)
library(rpart)
library(rpart.plot)
library(RWeka)
library(randomForest)
library(neuralnet)
library(kernlab)
library(party)
library(class)
library(gbm)
library(ada)
library(highcharter)
library(knitr)
library(kableExtra)

Step 2: Loading the Diabetes Dataset and Printing its Properties

## While loading the dataset, the code factors the outcome as either Postive (1) or Negative (0), respectively. It then prints the dimensions of the dataset in terms of the number of rows and columns. It also prints the structure of the dataset before printing the summary, header information and distribution of the outcome.
df <- read.csv("/Users/omkarsunkersett/Downloads/diabetes.csv", header = TRUE, stringsAsFactors = FALSE)
df$Outcome <- as.factor(df$Outcome)
levels(df$Outcome) <- c("Negative","Positive")
dim(df)
## [1] 768   9
str(df)
## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : Factor w/ 2 levels "Negative","Positive": 2 1 2 1 2 1 2 1 2 2 ...
summary(df)
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##      Outcome   
##  Negative:500  
##  Positive:268  
##                
##                
##                
## 
head(df)
##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age  Outcome
## 1                    0.627  50 Positive
## 2                    0.351  31 Negative
## 3                    0.672  32 Positive
## 4                    0.167  21 Negative
## 5                    2.288  33 Positive
## 6                    0.201  30 Negative
prop.table(table(df$Outcome))
## 
##  Negative  Positive 
## 0.6510417 0.3489583

Step 3: Handling the missing values in the dataset

## The below code removes all of the rows that contain a zero value because such rows are not significant for the purpose of analysis. The code replaces the zero values with NAs and drops these rows using the function drop_na(). It then prints the dimensions of the dataset in terms of the number of rows and columns along with the structure of the dataset, summary, header information and distribution of the outcome. We can observe a decrease in the number of rows by about 55%.
df[df == 0] <- NA
df <- df %>% drop_na()
dim(df)
## [1] 336   9
str(df)
## 'data.frame':    336 obs. of  9 variables:
##  $ Pregnancies             : int  1 3 2 1 5 1 1 3 11 10 ...
##  $ Glucose                 : int  89 78 197 189 166 103 115 126 143 125 ...
##  $ BloodPressure           : int  66 50 70 60 72 30 70 88 94 70 ...
##  $ SkinThickness           : int  23 32 45 23 19 38 30 41 33 26 ...
##  $ Insulin                 : int  94 88 543 846 175 83 96 235 146 115 ...
##  $ BMI                     : num  28.1 31 30.5 30.1 25.8 43.3 34.6 39.3 36.6 31.1 ...
##  $ DiabetesPedigreeFunction: num  0.167 0.248 0.158 0.398 0.587 0.183 0.529 0.704 0.254 0.205 ...
##  $ Age                     : int  21 26 53 59 51 33 32 27 51 41 ...
##  $ Outcome                 : Factor w/ 2 levels "Negative","Positive": 1 2 2 2 2 1 2 1 2 2 ...
summary(df)
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 1.000   Min.   : 56.0   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.:21.00  
##  Median : 3.000   Median :119.0   Median : 70.00   Median :28.50  
##  Mean   : 3.851   Mean   :122.3   Mean   : 70.24   Mean   :28.66  
##  3rd Qu.: 6.000   3rd Qu.:144.0   3rd Qu.: 78.00   3rd Qu.:36.00  
##  Max.   :17.000   Max.   :197.0   Max.   :110.00   Max.   :52.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   : 15.0   Min.   :18.20   Min.   :0.0850           Min.   :21.00  
##  1st Qu.: 76.0   1st Qu.:27.80   1st Qu.:0.2680           1st Qu.:24.00  
##  Median :125.5   Median :32.75   Median :0.4465           Median :28.00  
##  Mean   :155.3   Mean   :32.30   Mean   :0.5187           Mean   :31.84  
##  3rd Qu.:190.0   3rd Qu.:36.25   3rd Qu.:0.6883           3rd Qu.:38.00  
##  Max.   :846.0   Max.   :57.30   Max.   :2.3290           Max.   :81.00  
##      Outcome   
##  Negative:225  
##  Positive:111  
##                
##                
##                
## 
head(df)
##    Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 4            1      89            66            23      94 28.1
## 7            3      78            50            32      88 31.0
## 9            2     197            70            45     543 30.5
## 14           1     189            60            23     846 30.1
## 15           5     166            72            19     175 25.8
## 19           1     103            30            38      83 43.3
##    DiabetesPedigreeFunction Age  Outcome
## 4                     0.167  21 Negative
## 7                     0.248  26 Positive
## 9                     0.158  53 Positive
## 14                    0.398  59 Positive
## 15                    0.587  51 Positive
## 19                    0.183  33 Negative
prop.table(table(df$Outcome))
## 
##  Negative  Positive 
## 0.6696429 0.3303571

Step 4: Examining the histograms

## The below code generates the histograms for each variable of the dataset using different technqiues. We observe that the Triceps Skin Fold Thickness and Diastolic Blood Pressure variables have a nearly normal distribution, whereas the remaining variables of the dataset are skewed towards to the right.
par(mfrow = c(2, 2))
hist(df$Pregnancies)
hist(df$Glucose)
hist(df$BloodPressure)
hist(df$SkinThickness)

hist(df$Insulin)
hist(df$BMI)
hist(df$DiabetesPedigreeFunction)
hist(df$Age)

ggplot(reshape2::melt(df), aes(x=value, fill=variable)) + geom_histogram(binwidth=5) + facet_wrap(~variable)

grid.arrange(ggplot(df, aes(x=df[,1])) + geom_density() + xlab("Pregnancies"), ggplot(df, aes(x=df[,1], col=Outcome)) + geom_density(alpha=0.4) + xlab("Pregnancies"), ncol=2, top=paste("Pregnancies", " [ Skew:",skewness(df[,1]),"]"))

grid.arrange(ggplot(df, aes(x=df[,2])) + geom_density() + xlab("Glucose"), ggplot(df, aes(x=df[,2], col=Outcome)) + geom_density(alpha=0.4) + xlab("Glucose"), ncol=2, top=paste("Glucose", " [ Skew:",skewness(df[,2]),"]"))

grid.arrange(ggplot(df, aes(x=df[,3])) + geom_density() + xlab("Blood Pressure"), ggplot(df, aes(x=df[,3], col=Outcome)) + geom_density(alpha=0.4) + xlab("Blood Pressure"), ncol=2, top=paste("Blood Pressure", " [ Skew:",skewness(df[,3]),"]"))

grid.arrange(ggplot(df, aes(x=df[,4])) + geom_density() + xlab("Skin Thickness"), ggplot(df, aes(x=df[,4], col=Outcome)) + geom_density(alpha=0.4) + xlab("Skin Thickness"), ncol=2, top=paste("Skin Thickness", " [ Skew:",skewness(df[,4]),"]"))

grid.arrange(ggplot(df, aes(x=df[,5])) + geom_density() + xlab("Insulin"), ggplot(df, aes(x=df[,5], col=Outcome)) + geom_density(alpha=0.4) + xlab("Insulin"), ncol=2, top=paste("Insulin", " [ Skew:",skewness(df[,5]),"]"))

grid.arrange(ggplot(df, aes(x=df[,6])) + geom_density() + xlab("Body Mass Index"), ggplot(df, aes(x=df[,6], col=Outcome)) + geom_density(alpha=0.4) + xlab("Body Mass Index"), ncol=2, top=paste("Body Mass Index", " [ Skew:",skewness(df[,6]),"]"))

grid.arrange(ggplot(df, aes(x=df[,7])) + geom_density() + xlab("Diabetes Pedigree Function"), ggplot(df, aes(x=df[,7], col=Outcome)) + geom_density(alpha=0.4) + xlab("Diabetes Pedigree Function"), ncol=2, top=paste("Diabetes Pedigree Function", " [ Skew:",skewness(df[,7]),"]"))

grid.arrange(ggplot(df, aes(x=df[,8])) + geom_density() + xlab("Age"), ggplot(df, aes(x=df[,8], col=Outcome)) + geom_density(alpha=0.4) + xlab("Age"), ncol=2, top=paste("Age", " [ Skew:",skewness(df[,8]),"]"))

Step 5: Examining some Boxplots and Density Curves

## Figure 1 is a boxplot of the Diabetes Pedigree Function for each Test Result (Positive or Negative). The boxplot indicates that the median value of the pedigree function is higher for the tests that are positive. The inter-quartile range for this function is slightly greater for the tests that are positive.
## Figure 2 is the density curve for the variable Plasma Glucose for both outcomes (positive or negative). The density curve of the negative outcome has a higher peak value than that of the positive outcome. Notice how these density curves are skewed oppositely.
par(mfrow = c(1, 2))
boxplot(DiabetesPedigreeFunction ~ Outcome, data = df, ylab = "Diabetes Pedigree Function", xlab = "Test Results", main = "Figure 1", outline = FALSE)

positive <- subset(df, df$Outcome=='Positive')
negative <- subset(df, df$Outcome=='Negative')
plot(density(positive$Glucose), xlim = c(0, 250), ylim = c(0.00, 0.02), xlab = "Plasma Glucose", main = "Figure 2", col = "red", lwd = 2)
lines(density(negative$Glucose), col = "black", lwd = 2)
legend("topleft", col = c("red", "black"), legend = c("Positive", "Negative"), lwd = 2, bty = "n")

Step 6: Examining the Correlation of Variance for the Dataset

## The below figures depict the correlation of variance using both chart.Correlation() and pairs.panels(). We observe that the correlation is high between the variable pairs Pregnancies & Age, Skin Thickness & BMI, and Glucose & Insulin.
chart.Correlation(df[,-9], histogram=TRUE, col="grey10", pch=1, main="Chart.Correlation of Variance")

pairs.panels(df[,-9], method="pearson", hist.col = "#1fbbfa", density=TRUE, show.points=TRUE, pch=1, lm=TRUE, cex.cor=1, smoother=FALSE, stars=TRUE, main="Pairs.Panels of Variance")

Step 7: Performing Step 6 using other Functions

## The below figures depict the correlation of variance using corrplot(), ggpairs(), ggcorr() and ggcorrplot(). We observe that the correlation is high between the variable pairs Pregnancies & Age, Skin Thickness & BMI, and Glucose & Insulin.
corrplot(cor(df[,-9]))

corrplot(cor(df[,-9]), method = "number", type = "upper",  title = "\nCorrelation Plot of Variance", bg = 0xFF0000, addgrid.col = "darkgray")

ggpairs(df, aes(color=Outcome, alpha=0.80), lower=list(continuous="smooth")) + theme_bw() + labs(title="Correlation Plot of Variance (wrt. Outcome)") + theme(plot.title=element_text(face='bold',color='black',hjust=0.5,size=12))

ggcorr(df[,-9], name = "corr", label = TRUE) + theme(legend.position="none") + labs(title="Correlation Plot of Variance (figure 2)") + theme(plot.title=element_text(face='bold',color='black',hjust=0.5,size=12))

ggcorrplot(round(cor(df[,-9]), 1), hc.order = TRUE, type = "lower", lab = TRUE, lab_size = 3, method="circle", colors = c("red", "green", "blue"), title="Correlation Plot of Variance (figure 3)", ggtheme=theme_bw)

Step 8: Performing some Exploratory Analysis for Age Groups

## Generating some scatterplots, lined bar plots, stacked bar plots, box plots and classification pair plots for the age groups.
ggplot(data=df, aes(Glucose, Pregnancies)) + geom_jitter(aes(colour = Outcome))

ggplot(data=df, aes(Glucose,fill= Outcome)) + geom_bar(color = "black", width = 1) + xlab("Plasma Glucose") + ylab("Number of People") + theme(axis.text.x=element_text(angle=75, hjust=1)) + ggtitle("Plasma Glucose and Test Results")

df_ag <- df
df_ag$AgeGroup <- cut(df_ag$Age, breaks = c(20,35,50,100), labels = FALSE) %>% as.factor()
df_ag$AgeGroup <- as.integer(df_ag$AgeGroup)

ggplot(data=df_ag, aes(AgeGroup, fill = Outcome),y = (..count..)/sum(..count..)) + geom_bar(color = "black", width = 0.7) + xlab("AgeGroup 20-35, 35-50, 50-100") + ylab("Number of People") + theme(axis.text.x=element_text(angle=75, hjust=1)) + ggtitle("Age Group and Test Results") + stat_bin(geom = "text",aes(label = paste(round((..count..)/sum(..count..)*100), "%")),vjust = 2)

df_ag$AgeGroup <- as.factor(df_ag$AgeGroup)
ggplot(data=df_ag, aes(Pregnancies, fill = AgeGroup),y = (..count..)/sum(..count..)) + geom_bar(color = "black", width = 0.7) + xlab("Pregnancies") + ylab("Number of People") + theme(axis.text.x=element_text(angle=75, hjust=1)) + ggtitle("Age Group and Pregnancies")

myplot <- function(x,y) {
  ggplot(data = df_ag, aes(eval(parse(text = x)), eval(parse(text = y))))+geom_boxplot(outlier.colour = "blue") + xlab(x) + ylab(y)  + geom_jitter(alpha=0.2, aes(colour = Outcome))
}
p1 <- myplot("AgeGroup","SkinThickness")
p2 <- myplot("AgeGroup","Pregnancies")
p3 <- myplot("AgeGroup","Glucose")
p4 <- myplot("AgeGroup","BloodPressure")
p5 <- myplot("AgeGroup","BMI")
p6 <- myplot("AgeGroup","Insulin")
grid.arrange(p1, p2, p3, p4, p5, p6, ncol = 3)

clp <- clPairs(df[,-9], classification = df$Outcome, lower.panel = NULL)
clPairsLegend(0.1, 0.4, class = clp$class, col = clp$col, pch = clp$pch, title = "Classification Pairs Plot")

Step 9: Performing Dimensionality Reduction on the Variables

## Using techniques such as t-Distributed Stochastic Neighbor Embedding (t-SNE) and Principal Component Analysis (PCA) to perform dimensionality reduction on the variables of the dataset.
tsne <- function(x) {
  for (i in c(x)) {
    result <- Rtsne(df[,-9], perplexity = x, pca = TRUE, check_duplicates = FALSE)
    return(result$Y)
  }
}

x <- c(10, 20, 30, 40, 50)
df_tsne <- data.frame(tsne(x[1]), tsne(x[2]), tsne(x[3]), tsne(x[4]), tsne(x[5]), class = df$Outcome)

xs <- c(1,3,5,7,9)
ys <- c(2,4,6,8,10)
ggplot(df_tsne,aes(x=(df_tsne[,xs[1]]),y=(df_tsne[,ys[1]]),color=class)) + geom_point(size=1.5, alpha=0.6) + labs(x="", y="") + theme(axis.text.x=element_blank(), axis.text.y=element_blank()) + ggtitle(paste("Perplexity:", x[1])) + scale_color_tableau()

ggplot(df_tsne,aes(x=(df_tsne[,xs[2]]),y=(df_tsne[,ys[2]]),color=class)) + geom_point(size=1.5, alpha=0.6) + labs(x="", y="") + theme(axis.text.x=element_blank(), axis.text.y=element_blank()) + ggtitle(paste("Perplexity:", x[2])) + scale_color_tableau()

ggplot(df_tsne,aes(x=(df_tsne[,xs[3]]),y=(df_tsne[,ys[3]]),color=class)) + geom_point(size=1.5, alpha=0.6) + labs(x="", y="") + theme(axis.text.x=element_blank(), axis.text.y=element_blank()) + ggtitle(paste("Perplexity:", x[3])) + scale_color_tableau()

ggplot(df_tsne,aes(x=(df_tsne[,xs[4]]),y=(df_tsne[,ys[4]]),color=class)) + geom_point(size=1.5, alpha=0.6) + labs(x="", y="") + theme(axis.text.x=element_blank(), axis.text.y=element_blank()) + ggtitle(paste("Perplexity:", x[4])) + scale_color_tableau()

ggplot(df_tsne,aes(x=(df_tsne[,xs[5]]),y=(df_tsne[,ys[5]]),color=class)) + geom_point(size=1.5, alpha=0.6) + labs(x="", y="") + theme(axis.text.x=element_blank(), axis.text.y=element_blank()) + ggtitle(paste("Perplexity:", x[5])) + scale_color_tableau()

mu <- apply(df[,-9], 2, mean)
df.center <- as.matrix(df[,-9] - mean(mu))
S <- cov(df.center)
eigen(S)
## eigen() decomposition
## $values
## [1] 1.446559e+04 6.298215e+02 1.716000e+02 1.024493e+02 7.995919e+01
## [6] 1.901797e+01 5.069121e+00 1.025501e-01
## 
## $vectors
##               [,1]          [,2]         [,3]        [,4]         [,5]
## [1,] -0.0029050033  0.0366779083 -0.093374911  0.03830857  0.174415051
## [2,] -0.1572105456  0.9611908429  0.212947338 -0.03438768 -0.069676519
## [3,] -0.0113281672  0.1518849310 -0.774598045  0.40899636 -0.450862292
## [4,] -0.0177168282  0.0593180072 -0.396029981 -0.81455047  0.034211700
## [5,] -0.9870008329 -0.1596748873 -0.006298551  0.01567225 -0.003049791
## [6,] -0.0132746353  0.0252155158 -0.221791407 -0.35674876 -0.089228226
## [7,] -0.0004917723  0.0001696004 -0.001014924 -0.00303782  0.001045665
## [8,] -0.0220701158  0.1484833852 -0.373979343  0.19762317  0.867355322
##              [,6]         [,7]          [,8]
## [1,]  0.023166040  0.978514939  3.333198e-03
## [2,]  0.002645101 -0.002471398 -6.860792e-05
## [3,] -0.077870570 -0.013450894  9.616938e-04
## [4,] -0.417899796 -0.004374280 -2.335738e-03
## [5,] -0.005673450  0.002519608 -3.964990e-04
## [6,]  0.902536697 -0.013636420 -2.595313e-03
## [7,]  0.001462175 -0.003613076  9.999866e-01
## [8,]  0.064385229 -0.205175511 -1.557695e-03
pca1 <- prcomp(as.matrix(df[,-9]), center = TRUE)
summary(pca1)
## Importance of components:
##                             PC1     PC2      PC3      PC4     PC5     PC6
## Standard deviation     120.2730 25.0962 13.09962 10.12172 8.94199 4.36096
## Proportion of Variance   0.9349  0.0407  0.01109  0.00662 0.00517 0.00123
## Cumulative Proportion    0.9349  0.9756  0.98665  0.99327 0.99844 0.99967
##                            PC7     PC8
## Standard deviation     2.25147 0.32023
## Proportion of Variance 0.00033 0.00001
## Cumulative Proportion  0.99999 1.00000
eigen <- get_eigenvalue(pca1)
eigen
##         eigenvalue variance.percent cumulative.variance.percent
## Dim.1 1.446559e+04     9.348556e+01                    93.48556
## Dim.2 6.298215e+02     4.070295e+00                    97.55585
## Dim.3 1.716000e+02     1.108985e+00                    98.66484
## Dim.4 1.024493e+02     6.620907e-01                    99.32693
## Dim.5 7.995919e+01     5.167456e-01                    99.84367
## Dim.6 1.901797e+01     1.229058e-01                    99.96658
## Dim.7 5.069121e+00     3.275979e-02                    99.99934
## Dim.8 1.025501e-01     6.627417e-04                   100.00000
plot(pca1)

qualit_vars <- as.factor(df$Outcome)
biplot(pca1, choices = 1:2, scale = 1, pc.biplot = FALSE)

fviz_pca_biplot(pca1, axes = c(1, 2), geom = c("point", "text"), col.ind = "black", col.var = "steelblue", label = "all", invisible = "none", repel = T, habillage = qualit_vars, palette = NULL, addEllipses = TRUE, title = "PCA - Biplot")

Step 10: Dividing the Original Dataset into Training and Testing Sets

## The training dataset contains 70% of the rows from the original dataset, whereas the testing dataset contains the remaining 30%. It is important to check the dimension and distribution of the resulting datasets.
set.seed(12345)
ckpt <- sample(1:nrow(df), floor(0.70*nrow(df)))
train <- df[ckpt,]
test <- df[-ckpt,]
dim(train)
## [1] 235   9
dim(test)
## [1] 101   9
prop.table(table(train$Outcome))
## 
##  Negative  Positive 
## 0.6893617 0.3106383
prop.table(table(test$Outcome))
## 
##  Negative  Positive 
## 0.6237624 0.3762376

Step 11: Training and Testing a Set of Models using Supervised and Unsupervised Machine Learning Techniques

## IMPORTANT NOTE: I have selected all of the attributes (except for Outcome) as features of all of my predictive models. This is because all of the attributes have some effect on the Outcome (Test Results for Diabetes Mellitus). Hence, feature selection has been performed by selecting all of the attributes of the dataset. Moreover, I have used the confusion matrix for calculating the accuracy, sensitivity and specificity of my models. These three are the metrics of evaluation for my results.

Using the C5.0 Classification Model –

set.seed(1234)
c5_model <- C5.0(train[,-9], train$Outcome)
c5_model
## 
## Call:
## C5.0.default(x = train[, -9], y = train$Outcome)
## 
## Classification Tree
## Number of samples: 235 
## Number of predictors: 8 
## 
## Tree size: 16 
## 
## Non-standard options: attempt to group attributes
summary(c5_model)
## 
## Call:
## C5.0.default(x = train[, -9], y = train$Outcome)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Fri May 11 20:09:25 2018
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 235 cases (9 attributes) from undefined.data
## 
## Decision tree:
## 
## Glucose <= 127:
## :...DiabetesPedigreeFunction <= 0.673: Negative (107/5)
## :   DiabetesPedigreeFunction > 0.673:
## :   :...Age <= 43: Negative (33/8)
## :       Age > 43: Positive (5)
## Glucose > 127:
## :...Age <= 24:
##     :...BMI <= 38.7: Negative (13)
##     :   BMI > 38.7:
##     :   :...Insulin <= 335: Positive (2)
##     :       Insulin > 335: Negative (2)
##     Age > 24:
##     :...Glucose > 154:
##         :...Insulin <= 83: Negative (2)
##         :   Insulin > 83: Positive (31/2)
##         Glucose <= 154:
##         :...Glucose > 152: Negative (5)
##             Glucose <= 152:
##             :...Age > 55: Negative (3)
##                 Age <= 55:
##                 :...BloodPressure > 76: Positive (18/1)
##                     BloodPressure <= 76:
##                     :...DiabetesPedigreeFunction > 0.598: Negative (3)
##                         DiabetesPedigreeFunction <= 0.598:
##                         :...DiabetesPedigreeFunction > 0.415: Positive (5)
##                             DiabetesPedigreeFunction <= 0.415:
##                             :...BloodPressure <= 68: Negative (2)
##                                 BloodPressure > 68:
##                                 :...BloodPressure <= 70: Positive (2)
##                                     BloodPressure > 70: Negative (2)
## 
## 
## Evaluation on training data (235 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      16   16( 6.8%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     159     3    (a): class Negative
##      13    60    (b): class Positive
## 
## 
##  Attribute usage:
## 
##  100.00% Glucose
##   67.66% DiabetesPedigreeFunction
##   54.47% Age
##   15.74% Insulin
##   13.62% BloodPressure
##    7.23% BMI
## 
## 
## Time: 0.0 secs
plot(c5_model, subtree = 2)

c5_pred <- predict(c5_model, test[,-9])
cm_c5_orig <- confusionMatrix(table(c5_pred, test$Outcome))
cm_c5_orig
## Confusion Matrix and Statistics
## 
##           
## c5_pred    Negative Positive
##   Negative       55       21
##   Positive        8       17
##                                           
##                Accuracy : 0.7129          
##                  95% CI : (0.6143, 0.7985)
##     No Information Rate : 0.6238          
##     P-Value [Acc > NIR] : 0.03858         
##                                           
##                   Kappa : 0.3437          
##  Mcnemar's Test P-Value : 0.02586         
##                                           
##             Sensitivity : 0.8730          
##             Specificity : 0.4474          
##          Pos Pred Value : 0.7237          
##          Neg Pred Value : 0.6800          
##              Prevalence : 0.6238          
##          Detection Rate : 0.5446          
##    Detection Prevalence : 0.7525          
##       Balanced Accuracy : 0.6602          
##                                           
##        'Positive' Class : Negative        
## 
set.seed(1234)
c5_boost <- C5.0(train[,-9], train$Outcome, trials = 6)
c5_boost
## 
## Call:
## C5.0.default(x = train[, -9], y = train$Outcome, trials = 6)
## 
## Classification Tree
## Number of samples: 235 
## Number of predictors: 8 
## 
## Number of boosting iterations: 6 
## Average tree size: 13.3 
## 
## Non-standard options: attempt to group attributes
summary(c5_boost)
## 
## Call:
## C5.0.default(x = train[, -9], y = train$Outcome, trials = 6)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Fri May 11 20:09:26 2018
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 235 cases (9 attributes) from undefined.data
## 
## -----  Trial 0:  -----
## 
## Decision tree:
## 
## Glucose <= 127:
## :...DiabetesPedigreeFunction <= 0.673: Negative (107/5)
## :   DiabetesPedigreeFunction > 0.673:
## :   :...Age <= 43: Negative (33/8)
## :       Age > 43: Positive (5)
## Glucose > 127:
## :...Age <= 24:
##     :...BMI <= 38.7: Negative (13)
##     :   BMI > 38.7:
##     :   :...Insulin <= 335: Positive (2)
##     :       Insulin > 335: Negative (2)
##     Age > 24:
##     :...Glucose > 154:
##         :...Insulin <= 83: Negative (2)
##         :   Insulin > 83: Positive (31/2)
##         Glucose <= 154:
##         :...Glucose > 152: Negative (5)
##             Glucose <= 152:
##             :...Age > 55: Negative (3)
##                 Age <= 55:
##                 :...BloodPressure > 76: Positive (18/1)
##                     BloodPressure <= 76:
##                     :...DiabetesPedigreeFunction > 0.598: Negative (3)
##                         DiabetesPedigreeFunction <= 0.598:
##                         :...DiabetesPedigreeFunction > 0.415: Positive (5)
##                             DiabetesPedigreeFunction <= 0.415:
##                             :...BloodPressure <= 68: Negative (2)
##                                 BloodPressure > 68:
##                                 :...BloodPressure <= 70: Positive (2)
##                                     BloodPressure > 70: Negative (2)
## 
## -----  Trial 1:  -----
## 
## Decision tree:
## 
## BMI <= 26.3: Negative (40.7/2.3)
## BMI > 26.3:
## :...Insulin <= 68: Negative (25.4/1.5)
##     Insulin > 68:
##     :...Age <= 22: Negative (12.3)
##         Age > 22:
##         :...DiabetesPedigreeFunction > 0.528: Positive (78.3/16.1)
##             DiabetesPedigreeFunction <= 0.528:
##             :...BMI > 43.5: Positive (9.9)
##                 BMI <= 43.5:
##                 :...Pregnancies > 10: Positive (6.5)
##                     Pregnancies <= 10:
##                     :...Glucose <= 81: Positive (4.2)
##                         Glucose > 81:
##                         :...Glucose <= 127: Negative (23)
##                             Glucose > 127:
##                             :...DiabetesPedigreeFunction <= 0.306: Negative (22.5/4.6)
##                                 DiabetesPedigreeFunction > 0.306: Positive (12.3/3.1)
## 
## -----  Trial 2:  -----
## 
## Decision tree:
## 
## Glucose > 157:
## :...SkinThickness <= 17: Negative (5.1/0.6)
## :   SkinThickness > 17: Positive (26.1/1.2)
## Glucose <= 157:
## :...BMI <= 26.3: Negative (29.5)
##     BMI > 26.3:
##     :...Age <= 30:
##         :...SkinThickness <= 30: Negative (51.1/3.8)
##         :   SkinThickness > 30:
##         :   :...DiabetesPedigreeFunction > 0.893: Positive (10.4/0.6)
##         :       DiabetesPedigreeFunction <= 0.893:
##         :       :...Glucose > 119: Negative (25.4/2.4)
##         :           Glucose <= 119:
##         :           :...Insulin <= 82: Negative (4.8)
##         :               Insulin > 82: Positive (16.1/3)
##         Age > 30:
##         :...Pregnancies <= 1: Positive (12.4/0.6)
##             Pregnancies > 1:
##             :...Glucose <= 90: Negative (5)
##                 Glucose > 90:
##                 :...SkinThickness <= 26: Positive (14.6/1.2)
##                     SkinThickness > 26:
##                     :...BloodPressure <= 74: Negative (17.5/1.8)
##                         BloodPressure > 74: Positive (16.9/4.4)
## 
## -----  Trial 3:  -----
## 
## Decision tree:
## 
## Age <= 22: Negative (19.6)
## Age > 22:
## :...Insulin <= 87:
##     :...DiabetesPedigreeFunction <= 1.268: Negative (39.5/2)
##     :   DiabetesPedigreeFunction > 1.268: Positive (4.1)
##     Insulin > 87:
##     :...Glucose > 154: Positive (35.1/6.4)
##         Glucose <= 154:
##         :...Glucose > 152: Negative (12.8)
##             Glucose <= 152:
##             :...BMI > 39.7: Positive (15.7/0.9)
##                 BMI <= 39.7:
##                 :...Pregnancies > 10: Positive (7.2/0.5)
##                     Pregnancies <= 10:
##                     :...BloodPressure <= 50: Positive (8.5/0.9)
##                         BloodPressure > 50: Negative (92.5/31.8)
## 
## -----  Trial 4:  -----
## 
## Decision tree:
## 
## BMI <= 25: Negative (14.6)
## BMI > 25:
## :...Age <= 24: Negative (42/8)
##     Age > 24:
##     :...Insulin <= 86: Negative (29.2/7.7)
##         Insulin > 86:
##         :...DiabetesPedigreeFunction <= 0.229: Negative (19.9/4.4)
##             DiabetesPedigreeFunction > 0.229:
##             :...SkinThickness > 45: Positive (14.7)
##                 SkinThickness <= 45:
##                 :...Insulin <= 100: Positive (13.7/0.4)
##                     Insulin > 100:
##                     :...Insulin > 155:
##                         :...SkinThickness <= 41: Positive (51.9/10)
##                         :   SkinThickness > 41: Negative (3.8)
##                         Insulin <= 155:
##                         :...Glucose <= 124: Negative (10.4/0.4)
##                             Glucose > 124:
##                             :...Pregnancies <= 2: Positive (7.9)
##                                 Pregnancies > 2:
##                                 :...SkinThickness <= 28: Positive (15/5.3)
##                                     SkinThickness > 28: Negative (11.8/1.5)
## 
## -----  Trial 5:  -----
## 
## Decision tree:
## 
## Age <= 22: Negative (12.6)
## Age > 22:
## :...BloodPressure > 88: Positive (14.3/0.6)
##     BloodPressure <= 88:
##     :...Insulin <= 68: Negative (15)
##         Insulin > 68:
##         :...SkinThickness <= 30:
##             :...Age <= 26: Negative (25/2.4)
##             :   Age > 26:
##             :   :...Glucose > 181: Negative (5.3/0.3)
##             :       Glucose <= 181:
##             :       :...SkinThickness > 25: Negative (22.2/4.7)
##             :           SkinThickness <= 25:
##             :           :...BloodPressure > 82: Negative (2.6)
##             :               BloodPressure <= 82:
##             :               :...BMI <= 25.4: Negative (3.5)
##             :                   BMI > 25.4: Positive (27.8/2.8)
##             SkinThickness > 30:
##             :...Glucose > 157: Positive (14)
##                 Glucose <= 157:
##                 :...Glucose > 152: Negative (10.3)
##                     Glucose <= 152:
##                     :...BloodPressure > 78: Positive (22.2/1.8)
##                         BloodPressure <= 78:
##                         :...Age > 57: Negative (4.1)
##                             Age <= 57:
##                             :...Pregnancies > 7: Positive (4.8)
##                                 Pregnancies <= 7: [S1]
## 
## SubTree [S1]
## 
## DiabetesPedigreeFunction <= 0.332: Negative (11.6/1.6)
## DiabetesPedigreeFunction > 0.332:
## :...Age > 42: Positive (5.6)
##     Age <= 42:
##     :...Age > 32: Negative (3.6)
##         Age <= 32:
##         :...SkinThickness > 45: Negative (2.5)
##             SkinThickness <= 45:
##             :...BMI <= 41.3: Positive (24.7/4.6)
##                 BMI > 41.3: Negative (3.3/0.3)
## 
## 
## Evaluation on training data (235 cases):
## 
## Trial        Decision Tree   
## -----      ----------------  
##    Size      Errors  
## 
##    0     16   16( 6.8%)
##    1     10   36(15.3%)
##    2     13   28(11.9%)
##    3      9   32(13.6%)
##    4     12   30(12.8%)
##    5     20   28(11.9%)
## boost              1( 0.4%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     162          (a): class Negative
##       1    72    (b): class Positive
## 
## 
##  Attribute usage:
## 
##  100.00% Glucose
##  100.00% BMI
##  100.00% Age
##   96.60% DiabetesPedigreeFunction
##   92.77% Insulin
##   85.96% SkinThickness
##   83.83% BloodPressure
##   60.00% Pregnancies
## 
## 
## Time: 0.0 secs
plot(c5_boost, subtree = 2)

c5_boost_pred <- predict(c5_boost, test[,-9])
cm_c5_boost <- confusionMatrix(table(c5_boost_pred, test$Outcome))
cm_c5_boost
## Confusion Matrix and Statistics
## 
##              
## c5_boost_pred Negative Positive
##      Negative       52       12
##      Positive       11       26
##                                           
##                Accuracy : 0.7723          
##                  95% CI : (0.6782, 0.8498)
##     No Information Rate : 0.6238          
##     P-Value [Acc > NIR] : 0.001057        
##                                           
##                   Kappa : 0.5123          
##  Mcnemar's Test P-Value : 1.000000        
##                                           
##             Sensitivity : 0.8254          
##             Specificity : 0.6842          
##          Pos Pred Value : 0.8125          
##          Neg Pred Value : 0.7027          
##              Prevalence : 0.6238          
##          Detection Rate : 0.5149          
##    Detection Prevalence : 0.6337          
##       Balanced Accuracy : 0.7548          
##                                           
##        'Positive' Class : Negative        
## 

Using the Recursive Partitioning (R-PART) Model –

set.seed(1234)
rp_model <- rpart(Outcome~., data=train, cp=0.01)
rp_model
## n= 235 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 235 73 Negative (0.68936170 0.31063830)  
##    2) Glucose< 127.5 145 18 Negative (0.87586207 0.12413793)  
##      4) DiabetesPedigreeFunction< 0.6735 107  5 Negative (0.95327103 0.04672897) *
##      5) DiabetesPedigreeFunction>=0.6735 38 13 Negative (0.65789474 0.34210526)  
##       10) Age< 40 31  7 Negative (0.77419355 0.22580645) *
##       11) Age>=40 7  1 Positive (0.14285714 0.85714286) *
##    3) Glucose>=127.5 90 35 Positive (0.38888889 0.61111111)  
##      6) Age< 24.5 17  2 Negative (0.88235294 0.11764706) *
##      7) Age>=24.5 73 20 Positive (0.27397260 0.72602740)  
##       14) Glucose< 154.5 40 16 Positive (0.40000000 0.60000000)  
##         28) BloodPressure< 77 18  7 Negative (0.61111111 0.38888889) *
##         29) BloodPressure>=77 22  5 Positive (0.22727273 0.77272727) *
##       15) Glucose>=154.5 33  4 Positive (0.12121212 0.87878788) *
summary(rp_model)
## Call:
## rpart(formula = Outcome ~ ., data = train, cp = 0.01)
##   n= 235 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.27397260      0 1.0000000 1.0000000 0.09717670
## 2 0.17808219      1 0.7260274 0.8219178 0.09156676
## 3 0.03424658      2 0.5479452 0.5479452 0.07892062
## 4 0.02739726      4 0.4794521 0.6438356 0.08399841
## 5 0.01000000      6 0.4246575 0.7260274 0.08776409
## 
## Variable importance
##                  Glucose                      Age              Pregnancies 
##                       32                       23                       12 
##                  Insulin            BloodPressure            SkinThickness 
##                       11                        9                        6 
## DiabetesPedigreeFunction                      BMI 
##                        5                        1 
## 
## Node number 1: 235 observations,    complexity param=0.2739726
##   predicted class=Negative  expected loss=0.3106383  P(node) =1
##     class counts:   162    73
##    probabilities: 0.689 0.311 
##   left son=2 (145 obs) right son=3 (90 obs)
##   Primary splits:
##       Glucose     < 127.5  to the left,  improve=26.338000, (0 missing)
##       Age         < 28.5   to the left,  improve=21.612980, (0 missing)
##       Insulin     < 119.5  to the left,  improve=17.115850, (0 missing)
##       Pregnancies < 6.5    to the left,  improve=12.205670, (0 missing)
##       BMI         < 26.45  to the left,  improve= 8.832585, (0 missing)
##   Surrogate splits:
##       Insulin       < 123.5  to the left,  agree=0.728, adj=0.289, (0 split)
##       Age           < 33.5   to the left,  agree=0.694, adj=0.200, (0 split)
##       Pregnancies   < 6.5    to the left,  agree=0.689, adj=0.189, (0 split)
##       BloodPressure < 77     to the left,  agree=0.668, adj=0.133, (0 split)
##       SkinThickness < 32.5   to the left,  agree=0.660, adj=0.111, (0 split)
## 
## Node number 2: 145 observations,    complexity param=0.03424658
##   predicted class=Negative  expected loss=0.1241379  P(node) =0.6170213
##     class counts:   127    18
##    probabilities: 0.876 0.124 
##   left son=4 (107 obs) right son=5 (38 obs)
##   Primary splits:
##       DiabetesPedigreeFunction < 0.6735 to the left,  improve=4.893061, (0 missing)
##       Age                      < 43.5   to the left,  improve=3.696448, (0 missing)
##       Insulin                  < 145    to the left,  improve=3.336229, (0 missing)
##       BMI                      < 40.7   to the left,  improve=3.034738, (0 missing)
##       Pregnancies              < 7.5    to the left,  improve=1.968943, (0 missing)
##   Surrogate splits:
##       Insulin       < 203.5  to the left,  agree=0.759, adj=0.079, (0 split)
##       BMI           < 40.1   to the left,  agree=0.759, adj=0.079, (0 split)
##       SkinThickness < 9      to the right, agree=0.752, adj=0.053, (0 split)
## 
## Node number 3: 90 observations,    complexity param=0.1780822
##   predicted class=Positive  expected loss=0.3888889  P(node) =0.3829787
##     class counts:    35    55
##    probabilities: 0.389 0.611 
##   left son=6 (17 obs) right son=7 (73 obs)
##   Primary splits:
##       Age           < 24.5   to the left,  improve=10.207270, (0 missing)
##       Glucose       < 154.5  to the left,  improve= 5.011332, (0 missing)
##       BMI           < 29.5   to the left,  improve= 4.283294, (0 missing)
##       Pregnancies   < 1.5    to the left,  improve= 2.793894, (0 missing)
##       SkinThickness < 22.5   to the left,  improve= 2.793894, (0 missing)
##   Surrogate splits:
##       Pregnancies   < 1.5    to the left,  agree=0.867, adj=0.294, (0 split)
##       BloodPressure < 49     to the left,  agree=0.833, adj=0.118, (0 split)
##       SkinThickness < 13.5   to the left,  agree=0.833, adj=0.118, (0 split)
## 
## Node number 4: 107 observations
##   predicted class=Negative  expected loss=0.04672897  P(node) =0.4553191
##     class counts:   102     5
##    probabilities: 0.953 0.047 
## 
## Node number 5: 38 observations,    complexity param=0.03424658
##   predicted class=Negative  expected loss=0.3421053  P(node) =0.1617021
##     class counts:    25    13
##    probabilities: 0.658 0.342 
##   left son=10 (31 obs) right son=11 (7 obs)
##   Primary splits:
##       Age           < 40     to the left,  improve=4.552268, (0 missing)
##       Pregnancies   < 3.5    to the left,  improve=3.296568, (0 missing)
##       BloodPressure < 75     to the left,  improve=2.377153, (0 missing)
##       Insulin       < 140    to the left,  improve=2.158484, (0 missing)
##       Glucose       < 110.5  to the left,  improve=1.523725, (0 missing)
##   Surrogate splits:
##       Pregnancies < 7.5    to the left,  agree=0.868, adj=0.286, (0 split)
## 
## Node number 6: 17 observations
##   predicted class=Negative  expected loss=0.1176471  P(node) =0.07234043
##     class counts:    15     2
##    probabilities: 0.882 0.118 
## 
## Node number 7: 73 observations,    complexity param=0.02739726
##   predicted class=Positive  expected loss=0.2739726  P(node) =0.3106383
##     class counts:    20    53
##    probabilities: 0.274 0.726 
##   left son=14 (40 obs) right son=15 (33 obs)
##   Primary splits:
##       Glucose                  < 154.5  to the left,  improve=2.810793, (0 missing)
##       Insulin                  < 142    to the left,  improve=2.586832, (0 missing)
##       DiabetesPedigreeFunction < 0.3425 to the left,  improve=1.525798, (0 missing)
##       BMI                      < 29.5   to the left,  improve=1.402015, (0 missing)
##       SkinThickness            < 44     to the left,  improve=1.162308, (0 missing)
##   Surrogate splits:
##       Insulin       < 238.5  to the left,  agree=0.685, adj=0.303, (0 split)
##       BloodPressure < 71     to the right, agree=0.658, adj=0.242, (0 split)
##       Pregnancies   < 3.5    to the right, agree=0.630, adj=0.182, (0 split)
##       SkinThickness < 20     to the right, agree=0.616, adj=0.152, (0 split)
##       BMI           < 25.85  to the right, agree=0.575, adj=0.061, (0 split)
## 
## Node number 10: 31 observations
##   predicted class=Negative  expected loss=0.2258065  P(node) =0.1319149
##     class counts:    24     7
##    probabilities: 0.774 0.226 
## 
## Node number 11: 7 observations
##   predicted class=Positive  expected loss=0.1428571  P(node) =0.02978723
##     class counts:     1     6
##    probabilities: 0.143 0.857 
## 
## Node number 14: 40 observations,    complexity param=0.02739726
##   predicted class=Positive  expected loss=0.4  P(node) =0.1702128
##     class counts:    16    24
##    probabilities: 0.400 0.600 
##   left son=28 (18 obs) right son=29 (22 obs)
##   Primary splits:
##       BloodPressure < 77     to the left,  improve=2.917172, (0 missing)
##       Pregnancies   < 3.5    to the right, improve=2.899060, (0 missing)
##       Glucose       < 130.5  to the right, improve=2.715152, (0 missing)
##       BMI           < 31.45  to the left,  improve=2.540659, (0 missing)
##       SkinThickness < 31.5   to the left,  improve=1.786895, (0 missing)
##   Surrogate splits:
##       SkinThickness < 33.5   to the left,  agree=0.675, adj=0.278, (0 split)
##       Age           < 30     to the left,  agree=0.675, adj=0.278, (0 split)
##       Pregnancies   < 7.5    to the left,  agree=0.650, adj=0.222, (0 split)
##       Insulin       < 186    to the right, agree=0.650, adj=0.222, (0 split)
##       BMI           < 36.25  to the left,  agree=0.650, adj=0.222, (0 split)
## 
## Node number 15: 33 observations
##   predicted class=Positive  expected loss=0.1212121  P(node) =0.1404255
##     class counts:     4    29
##    probabilities: 0.121 0.879 
## 
## Node number 28: 18 observations
##   predicted class=Negative  expected loss=0.3888889  P(node) =0.07659574
##     class counts:    11     7
##    probabilities: 0.611 0.389 
## 
## Node number 29: 22 observations
##   predicted class=Positive  expected loss=0.2272727  P(node) =0.09361702
##     class counts:     5    17
##    probabilities: 0.227 0.773
rpart.plot(rp_model, type = 4, extra = 1, clip.right.labs = FALSE)

rp_pred <- predict(rp_model, test, type = 'class')
cm_rp_orig <- confusionMatrix(table(rp_pred, test$Outcome))
cm_rp_orig
## Confusion Matrix and Statistics
## 
##           
## rp_pred    Negative Positive
##   Negative       55       20
##   Positive        8       18
##                                           
##                Accuracy : 0.7228          
##                  95% CI : (0.6248, 0.8072)
##     No Information Rate : 0.6238          
##     P-Value [Acc > NIR] : 0.02377         
##                                           
##                   Kappa : 0.3699          
##  Mcnemar's Test P-Value : 0.03764         
##                                           
##             Sensitivity : 0.8730          
##             Specificity : 0.4737          
##          Pos Pred Value : 0.7333          
##          Neg Pred Value : 0.6923          
##              Prevalence : 0.6238          
##          Detection Rate : 0.5446          
##    Detection Prevalence : 0.7426          
##       Balanced Accuracy : 0.6734          
##                                           
##        'Positive' Class : Negative        
## 
set.seed(1234)
control <- rpart.control(cp = 0.000, xxval = 100, minsplit = 2)
rp_model <- rpart(Outcome~., data = train, control = control)
plotcp(rp_model)

printcp(rp_model)
## 
## Classification tree:
## rpart(formula = Outcome ~ ., data = train, control = control)
## 
## Variables actually used in tree construction:
## [1] Age                      BloodPressure           
## [3] BMI                      DiabetesPedigreeFunction
## [5] Glucose                  Insulin                 
## [7] Pregnancies              SkinThickness           
## 
## Root node error: 73/235 = 0.31064
## 
## n= 235 
## 
##          CP nsplit rel error  xerror     xstd
## 1 0.2739726      0  1.000000 1.00000 0.097177
## 2 0.1780822      1  0.726027 0.82192 0.091567
## 3 0.0342466      2  0.547945 0.54795 0.078921
## 4 0.0273973      6  0.410959 0.58904 0.081195
## 5 0.0182648     11  0.273973 0.61644 0.082628
## 6 0.0136986     14  0.219178 0.65753 0.084661
## 7 0.0068493     26  0.054795 0.64384 0.083998
## 8 0.0000000     34  0.000000 0.65753 0.084661
set.seed(1234)
selected_tr <- prune(rp_model, cp = rp_model$cptable[which.min(rp_model$cptable[,"xerror"]), "CP"])
rpart.plot(selected_tr, type = 4, extra = 1, clip.right.labs = FALSE)

rp_pred_tune <- predict(selected_tr, test, type = 'class')
cm_rp_tune <- confusionMatrix(table(rp_pred_tune, test$Outcome))
cm_rp_tune
## Confusion Matrix and Statistics
## 
##             
## rp_pred_tune Negative Positive
##     Negative       53       16
##     Positive       10       22
##                                          
##                Accuracy : 0.7426         
##                  95% CI : (0.646, 0.8244)
##     No Information Rate : 0.6238         
##     P-Value [Acc > NIR] : 0.007895       
##                                          
##                   Kappa : 0.4338         
##  Mcnemar's Test P-Value : 0.326800       
##                                          
##             Sensitivity : 0.8413         
##             Specificity : 0.5789         
##          Pos Pred Value : 0.7681         
##          Neg Pred Value : 0.6875         
##              Prevalence : 0.6238         
##          Detection Rate : 0.5248         
##    Detection Prevalence : 0.6832         
##       Balanced Accuracy : 0.7101         
##                                          
##        'Positive' Class : Negative       
## 

Using the One Rule Classification Model –

set.seed(1234)
oneR_model <- OneR(Outcome~., data = train)
oneR_model
## Glucose:
##  < 127.5 -> Negative
##  < 129.5 -> Positive
##  < 143.5 -> Negative
##  < 149.0 -> Positive
##  < 154.5 -> Negative
##  >= 154.5    -> Positive
## (193/235 instances correct)
summary(oneR_model)
## 
## === Summary ===
## 
## Correctly Classified Instances         193               82.1277 %
## Incorrectly Classified Instances        42               17.8723 %
## Kappa statistic                          0.5524
## Mean absolute error                      0.1787
## Root mean squared error                  0.4228
## Relative absolute error                 41.6712 %
## Root relative squared error             91.356  %
## Total Number of Instances              235     
## 
## === Confusion Matrix ===
## 
##    a   b   <-- classified as
##  150  12 |   a = Negative
##   30  43 |   b = Positive
oneR_pred <- predict(oneR_model, test, type = 'class')
cm_oneR_orig <- confusionMatrix(oneR_pred, test$Outcome)
cm_oneR_orig
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Negative Positive
##   Negative       53       19
##   Positive       10       19
##                                           
##                Accuracy : 0.7129          
##                  95% CI : (0.6143, 0.7985)
##     No Information Rate : 0.6238          
##     P-Value [Acc > NIR] : 0.03858         
##                                           
##                   Kappa : 0.3581          
##  Mcnemar's Test P-Value : 0.13739         
##                                           
##             Sensitivity : 0.8413          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.7361          
##          Neg Pred Value : 0.6552          
##              Prevalence : 0.6238          
##          Detection Rate : 0.5248          
##    Detection Prevalence : 0.7129          
##       Balanced Accuracy : 0.6706          
##                                           
##        'Positive' Class : Negative        
## 

Using the JRip Rule Learning Model –

set.seed(1234)
jrip_model <- JRip(Outcome~., data = train)
jrip_model
## JRIP rules:
## ===========
## 
## (Glucose >= 128) and (Age >= 25) => Outcome=Positive (73.0/20.0)
## (Age >= 41) and (DiabetesPedigreeFunction >= 0.412) => Outcome=Positive (9.0/2.0)
##  => Outcome=Negative (153.0/13.0)
## 
## Number of Rules : 3
summary(jrip_model)
## 
## === Summary ===
## 
## Correctly Classified Instances         200               85.1064 %
## Incorrectly Classified Instances        35               14.8936 %
## Kappa statistic                          0.6636
## Mean absolute error                      0.2381
## Root mean squared error                  0.345 
## Relative absolute error                 55.5051 %
## Root relative squared error             74.5539 %
## Total Number of Instances              235     
## 
## === Confusion Matrix ===
## 
##    a   b   <-- classified as
##  140  22 |   a = Negative
##   13  60 |   b = Positive
jrip_pred <- predict(jrip_model, test, type = 'class')
cm_jrip_orig <- confusionMatrix(jrip_pred, test$Outcome)
cm_jrip_orig
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Negative Positive
##   Negative       48       15
##   Positive       15       23
##                                           
##                Accuracy : 0.703           
##                  95% CI : (0.6039, 0.7898)
##     No Information Rate : 0.6238          
##     P-Value [Acc > NIR] : 0.06002         
##                                           
##                   Kappa : 0.3672          
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##             Sensitivity : 0.7619          
##             Specificity : 0.6053          
##          Pos Pred Value : 0.7619          
##          Neg Pred Value : 0.6053          
##              Prevalence : 0.6238          
##          Detection Rate : 0.4752          
##    Detection Prevalence : 0.6238          
##       Balanced Accuracy : 0.6836          
##                                           
##        'Positive' Class : Negative        
## 

Using the Naive Bayes Model (with and without Laplace Smoothing; Laplace Parameter = 50) –

set.seed(1234)
nb_model <- naiveBayes(train, train$Outcome)
nb_model
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = train, y = train$Outcome)
## 
## A-priori probabilities:
## train$Outcome
##  Negative  Positive 
## 0.6893617 0.3106383 
## 
## Conditional probabilities:
##              Pregnancies
## train$Outcome     [,1]     [,2]
##      Negative 3.067901 2.482552
##      Positive 5.479452 3.869789
## 
##              Glucose
## train$Outcome     [,1]     [,2]
##      Negative 110.3333 25.43705
##      Positive 144.3288 28.50539
## 
##              BloodPressure
## train$Outcome     [,1]     [,2]
##      Negative 67.35802 11.68405
##      Positive 74.19178 12.52182
## 
##              SkinThickness
## train$Outcome     [,1]     [,2]
##      Negative 26.57407 9.945695
##      Positive 32.75342 9.352333
## 
##              Insulin
## train$Outcome     [,1]      [,2]
##      Negative 122.3025  87.12696
##      Positive 200.9041 120.79450
## 
##              BMI
## train$Outcome     [,1]     [,2]
##      Negative 30.86605 6.205399
##      Positive 34.83425 5.665348
## 
##              DiabetesPedigreeFunction
## train$Outcome      [,1]      [,2]
##      Negative 0.4670185 0.2952471
##      Positive 0.6424521 0.3718617
## 
##              Age
## train$Outcome     [,1]      [,2]
##      Negative 28.46914  9.391534
##      Positive 37.71233 10.123514
## 
##              Outcome
## train$Outcome Negative Positive
##      Negative        1        0
##      Positive        0        1
nb_pred <- predict(nb_model, test)
cm_nb_orig <- confusionMatrix(table(nb_pred, test$Outcome))
cm_nb_orig
## Confusion Matrix and Statistics
## 
##           
## nb_pred    Negative Positive
##   Negative       61        1
##   Positive        2       37
##                                           
##                Accuracy : 0.9703          
##                  95% CI : (0.9156, 0.9938)
##     No Information Rate : 0.6238          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.937           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9683          
##             Specificity : 0.9737          
##          Pos Pred Value : 0.9839          
##          Neg Pred Value : 0.9487          
##              Prevalence : 0.6238          
##          Detection Rate : 0.6040          
##    Detection Prevalence : 0.6139          
##       Balanced Accuracy : 0.9710          
##                                           
##        'Positive' Class : Negative        
## 
set.seed(1234)
nb_lap_model <- naiveBayes(train, train$Outcome, laplace = 50)
nb_lap_model
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = train, y = train$Outcome, laplace = 50)
## 
## A-priori probabilities:
## train$Outcome
##  Negative  Positive 
## 0.6893617 0.3106383 
## 
## Conditional probabilities:
##              Pregnancies
## train$Outcome     [,1]     [,2]
##      Negative 3.067901 2.482552
##      Positive 5.479452 3.869789
## 
##              Glucose
## train$Outcome     [,1]     [,2]
##      Negative 110.3333 25.43705
##      Positive 144.3288 28.50539
## 
##              BloodPressure
## train$Outcome     [,1]     [,2]
##      Negative 67.35802 11.68405
##      Positive 74.19178 12.52182
## 
##              SkinThickness
## train$Outcome     [,1]     [,2]
##      Negative 26.57407 9.945695
##      Positive 32.75342 9.352333
## 
##              Insulin
## train$Outcome     [,1]      [,2]
##      Negative 122.3025  87.12696
##      Positive 200.9041 120.79450
## 
##              BMI
## train$Outcome     [,1]     [,2]
##      Negative 30.86605 6.205399
##      Positive 34.83425 5.665348
## 
##              DiabetesPedigreeFunction
## train$Outcome      [,1]      [,2]
##      Negative 0.4670185 0.2952471
##      Positive 0.6424521 0.3718617
## 
##              Age
## train$Outcome     [,1]      [,2]
##      Negative 28.46914  9.391534
##      Positive 37.71233 10.123514
## 
##              Outcome
## train$Outcome  Negative  Positive
##      Negative 0.8091603 0.1908397
##      Positive 0.2890173 0.7109827
nb_lap_pred <- predict(nb_lap_model, test)
cm_nb_lapl<- confusionMatrix(table(nb_lap_pred, test$Outcome))
cm_nb_lapl
## Confusion Matrix and Statistics
## 
##            
## nb_lap_pred Negative Positive
##    Negative       53       11
##    Positive       10       27
##                                           
##                Accuracy : 0.7921          
##                  95% CI : (0.6999, 0.8664)
##     No Information Rate : 0.6238          
##     P-Value [Acc > NIR] : 0.0002133       
##                                           
##                   Kappa : 0.5547          
##  Mcnemar's Test P-Value : 1.0000000       
##                                           
##             Sensitivity : 0.8413          
##             Specificity : 0.7105          
##          Pos Pred Value : 0.8281          
##          Neg Pred Value : 0.7297          
##              Prevalence : 0.6238          
##          Detection Rate : 0.5248          
##    Detection Prevalence : 0.6337          
##       Balanced Accuracy : 0.7759          
##                                           
##        'Positive' Class : Negative        
## 

Using the Linear Discriminant Analysis (LDA) Model –

set.seed(1234)
lda_model <- lda(data = train, Outcome~.)
lda_model
## Call:
## lda(Outcome ~ ., data = train)
## 
## Prior probabilities of groups:
##  Negative  Positive 
## 0.6893617 0.3106383 
## 
## Group means:
##          Pregnancies  Glucose BloodPressure SkinThickness  Insulin
## Negative    3.067901 110.3333      67.35802      26.57407 122.3025
## Positive    5.479452 144.3288      74.19178      32.75342 200.9041
##               BMI DiabetesPedigreeFunction      Age
## Negative 30.86605                0.4670185 28.46914
## Positive 34.83425                0.6424521 37.71233
## 
## Coefficients of linear discriminants:
##                                  LD1
## Pregnancies              0.072531222
## Glucose                  0.023410050
## BloodPressure            0.009540306
## SkinThickness            0.006584657
## Insulin                  0.000944487
## BMI                      0.038839475
## DiabetesPedigreeFunction 1.093580091
## Age                      0.024590227
plot(lda_model)

lda_pred <- predict(lda_model, test)
cm_lda_orig <- confusionMatrix(table(lda_pred$class, test$Outcome))
cm_lda_orig
## Confusion Matrix and Statistics
## 
##           
##            Negative Positive
##   Negative       57       19
##   Positive        6       19
##                                          
##                Accuracy : 0.7525         
##                  95% CI : (0.6567, 0.833)
##     No Information Rate : 0.6238         
##     P-Value [Acc > NIR] : 0.004243       
##                                          
##                   Kappa : 0.4342         
##  Mcnemar's Test P-Value : 0.016395       
##                                          
##             Sensitivity : 0.9048         
##             Specificity : 0.5000         
##          Pos Pred Value : 0.7500         
##          Neg Pred Value : 0.7600         
##              Prevalence : 0.6238         
##          Detection Rate : 0.5644         
##    Detection Prevalence : 0.7525         
##       Balanced Accuracy : 0.7024         
##                                          
##        'Positive' Class : Negative       
## 

Using the Random Forest Classification Model –

set.seed(1234)
rf_model <- randomForest(Outcome~., data = train, ntree = 500, proximity = TRUE, importance = TRUE)
rf_model
## 
## Call:
##  randomForest(formula = Outcome ~ ., data = train, ntree = 500,      proximity = TRUE, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 20%
## Confusion matrix:
##          Negative Positive class.error
## Negative      142       20   0.1234568
## Positive       27       46   0.3698630
varImpPlot(rf_model, cex=0.5)

plot(rf_model, log = "x", main="Random Forest (Error Rate vs. Number of Trees)")

rf_pred <- predict(rf_model, test)
cm_rf_orig <- confusionMatrix(table(rf_pred, test$Outcome))
cm_rf_orig
## Confusion Matrix and Statistics
## 
##           
## rf_pred    Negative Positive
##   Negative       56       18
##   Positive        7       20
##                                          
##                Accuracy : 0.7525         
##                  95% CI : (0.6567, 0.833)
##     No Information Rate : 0.6238         
##     P-Value [Acc > NIR] : 0.004243       
##                                          
##                   Kappa : 0.4405         
##  Mcnemar's Test P-Value : 0.045500       
##                                          
##             Sensitivity : 0.8889         
##             Specificity : 0.5263         
##          Pos Pred Value : 0.7568         
##          Neg Pred Value : 0.7407         
##              Prevalence : 0.6238         
##          Detection Rate : 0.5545         
##    Detection Prevalence : 0.7327         
##       Balanced Accuracy : 0.7076         
##                                          
##        'Positive' Class : Negative       
## 
set.seed(1234)
rf_new_model <- randomForest(Outcome~., data = train, ntree = 2000, proximity = TRUE, importance = TRUE)
rf_new_model
## 
## Call:
##  randomForest(formula = Outcome ~ ., data = train, ntree = 2000,      proximity = TRUE, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 2000
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 20.43%
## Confusion matrix:
##          Negative Positive class.error
## Negative      143       19   0.1172840
## Positive       29       44   0.3972603
varImpPlot(rf_new_model, cex=0.5)

plot(rf_new_model, log = "x", main="Random Forest (Error Rate vs. Number of Trees)")

rf_new_pred <- predict(rf_new_model, test)
cm_rf_tune <- confusionMatrix(table(rf_new_pred, test$Outcome))
cm_rf_tune
## Confusion Matrix and Statistics
## 
##            
## rf_new_pred Negative Positive
##    Negative       56       17
##    Positive        7       21
##                                           
##                Accuracy : 0.7624          
##                  95% CI : (0.6674, 0.8414)
##     No Information Rate : 0.6238          
##     P-Value [Acc > NIR] : 0.002172        
##                                           
##                   Kappa : 0.4658          
##  Mcnemar's Test P-Value : 0.066193        
##                                           
##             Sensitivity : 0.8889          
##             Specificity : 0.5526          
##          Pos Pred Value : 0.7671          
##          Neg Pred Value : 0.7500          
##              Prevalence : 0.6238          
##          Detection Rate : 0.5545          
##    Detection Prevalence : 0.7228          
##       Balanced Accuracy : 0.7208          
##                                           
##        'Positive' Class : Negative        
## 

Using the Classification Tree (C-Tree) Model –

set.seed(1234)
ctree_model <- ctree(Outcome~., data = train, controls=ctree_control(maxdepth=5))
ctree_model
## 
##   Conditional inference tree with 5 terminal nodes
## 
## Response:  Outcome 
## Inputs:  Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age 
## Number of observations:  235 
## 
## 1) Glucose <= 127; criterion = 1, statistic = 61.625
##   2) Age <= 43; criterion = 0.999, statistic = 15.76
##     3) BMI <= 40.5; criterion = 0.991, statistic = 10.702
##       4) DiabetesPedigreeFunction <= 0.673; criterion = 0.983, statistic = 9.42
##         5)*  weights = 96 
##       4) DiabetesPedigreeFunction > 0.673
##         6)*  weights = 28 
##     3) BMI > 40.5
##       7)*  weights = 9 
##   2) Age > 43
##     8)*  weights = 12 
## 1) Glucose > 127
##   9)*  weights = 90
plot(ctree_model)

ctree_pred <- predict(ctree_model, test)
cm_ctree_orig <- confusionMatrix(table(ctree_pred, test$Outcome))
cm_ctree_orig
## Confusion Matrix and Statistics
## 
##           
## ctree_pred Negative Positive
##   Negative       49       13
##   Positive       14       25
##                                           
##                Accuracy : 0.7327          
##                  95% CI : (0.6354, 0.8159)
##     No Information Rate : 0.6238          
##     P-Value [Acc > NIR] : 0.01401         
##                                           
##                   Kappa : 0.4334          
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##             Sensitivity : 0.7778          
##             Specificity : 0.6579          
##          Pos Pred Value : 0.7903          
##          Neg Pred Value : 0.6410          
##              Prevalence : 0.6238          
##          Detection Rate : 0.4851          
##    Detection Prevalence : 0.6139          
##       Balanced Accuracy : 0.7178          
##                                           
##        'Positive' Class : Negative        
## 

Using the K-Means Clustering Technique (Best Result: K = 7) –

set.seed(1234)
df_z <- as.data.frame(lapply(df[,-9], scale))
km_model <- kmeans(df_z, 3)
km_model
## K-means clustering with 3 clusters of sizes 148, 110, 78
## 
## Cluster means:
##   Pregnancies    Glucose BloodPressure SkinThickness    Insulin        BMI
## 1  -0.4399046 -0.5044645    -0.4307171    -0.7562568 -0.4395881 -0.6866542
## 2  -0.4291739  0.2404463     0.2037941     0.7627544  0.2753585  0.7769629
## 3   1.4399360  0.6180980     0.5298561     0.3592695  0.4457642  0.2071655
##   DiabetesPedigreeFunction        Age
## 1              -0.12608667 -0.5341434
## 2               0.13348950 -0.2920598
## 3               0.05098695  1.4253819
## 
## Clustering vector:
##   [1] 1 1 3 3 3 2 2 2 3 3 1 3 2 1 1 3 2 3 1 1 1 3 3 3 1 1 1 1 2 2 1 2 1 3 1
##  [36] 2 1 3 1 1 2 1 1 1 1 2 3 1 3 1 1 2 2 2 2 1 2 1 1 2 1 2 2 2 3 2 1 1 1 3
##  [71] 3 1 1 2 2 1 3 3 2 3 2 3 2 1 2 2 1 3 3 1 3 3 2 1 3 1 1 2 3 1 1 3 1 1 2
## [106] 3 1 3 1 3 1 1 1 2 2 1 3 3 3 2 2 1 2 2 2 2 2 3 2 2 2 3 2 1 1 1 1 2 1 3
## [141] 1 2 2 1 1 1 3 1 1 3 1 1 1 2 3 2 2 2 1 1 2 2 2 2 3 2 1 1 1 1 1 3 1 1 1
## [176] 1 1 1 2 2 2 2 2 1 2 1 2 1 3 2 2 2 1 1 1 1 1 1 1 1 3 3 3 3 2 2 1 3 2 1
## [211] 2 1 1 1 3 3 1 1 1 1 1 1 3 3 1 2 1 1 1 2 1 2 3 2 2 1 3 3 1 2 1 1 1 3 2
## [246] 1 1 2 3 2 1 1 2 2 1 3 3 2 1 2 1 1 3 2 1 1 1 2 3 3 1 2 2 1 1 2 1 1 2 1
## [281] 3 1 1 2 1 2 1 2 2 3 3 2 3 3 3 3 2 1 1 2 1 2 2 3 3 1 1 2 1 1 2 1 1 3 2
## [316] 2 2 2 3 2 1 2 1 1 3 1 1 3 3 2 2 2 2 1 3 1
## 
## Within cluster sum of squares by cluster:
## [1] 566.2729 681.6656 528.4083
##  (between_SS / total_SS =  33.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
sil3 <- silhouette(km_model$cluster, dist(df_z))
summary(sil3)
## Silhouette of 336 units in 3 clusters from silhouette.default(x = km_model$cluster, dist = dist(df_z)) :
##  Cluster sizes and average silhouette widths:
##       148       110        78 
## 0.2952418 0.0968107 0.1579464 
## Individual silhouette widths:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.11707  0.09749  0.19106  0.19841  0.29783  0.46749
plot(sil3, col=1:length(km_model$size), border=NA)

km_model$centers
##   Pregnancies    Glucose BloodPressure SkinThickness    Insulin        BMI
## 1  -0.4399046 -0.5044645    -0.4307171    -0.7562568 -0.4395881 -0.6866542
## 2  -0.4291739  0.2404463     0.2037941     0.7627544  0.2753585  0.7769629
## 3   1.4399360  0.6180980     0.5298561     0.3592695  0.4457642  0.2071655
##   DiabetesPedigreeFunction        Age
## 1              -0.12608667 -0.5341434
## 2               0.13348950 -0.2920598
## 3               0.05098695  1.4253819
par(mfrow=c(1, 1), mar=c(4, 4, 4, 2))
myColors <- c("darkblue", "red", "green", "brown", "pink", "purple", "yellow", "orange")
barplot(t(km_model$centers), beside = TRUE, xlab="cluster", ylab="value", col = myColors) 
legend("top", ncol=2, legend = c("Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age"), fill = myColors)

df_km <- df
df_km$clusters <- km_model$cluster
ggplot(df_km, aes(Glucose, BloodPressure), main="Scatterplot: Glucose vs BloodPressure") +
  geom_point(aes(colour = factor(clusters), shape=factor(clusters), stroke = 8), alpha=1) + 
  theme_bw(base_size=25) +
  geom_text(aes(label=ifelse(clusters%in%1, as.character(clusters), ''), hjust=2, vjust=2, colour = factor(clusters)))+
  geom_text(aes(label=ifelse(clusters%in%2, as.character(clusters), ''), hjust=-2, vjust=2, colour = factor(clusters)))+
  geom_text(aes(label=ifelse(clusters%in%3, as.character(clusters), ''), hjust=2, vjust=-1, colour = factor(clusters))) + 
  guides(colour = guide_legend(override.aes = list(size=8))) +
theme(legend.position="top")

kpp_init = function(dat, K) {
  x = as.matrix(dat)
  n = nrow(x)
  # Randomly choose a first center
  centers = matrix(NA, nrow=K, ncol=ncol(x))
  set.seed(123)
  centers[1,] = as.matrix(x[sample(1:n, 1),])
  for (k in 2:K) {
    # Calculate dist^2 to closest center for each point
    dists = matrix(NA, nrow=n, ncol=k-1)
    for (j in 1:(k-1)) {
      temp = sweep(x, 2, centers[j,], '-')
      dists[,j] = rowSums(temp^2)
    }
    dists = rowMins(dists)
    # Draw next center with probability proportional to dist^2
    cumdists = cumsum(dists)
    prop = runif(1, min=0, max=cumdists[n])
    centers[k,] = as.matrix(x[min(which(cumdists > prop)),])
  }
  return(centers)
}

kmp_model <- kmeans(df_z, kpp_init(df_z, 3), iter.max=100, algorithm='Lloyd')
kmp_model
## K-means clustering with 3 clusters of sizes 113, 145, 78
## 
## Cluster means:
##   Pregnancies    Glucose BloodPressure SkinThickness    Insulin        BMI
## 1  -0.4193360  0.1838024     0.2100280     0.7312880  0.1782657  0.7516222
## 2  -0.4390313 -0.5118013    -0.4475874    -0.7719095 -0.4446160 -0.6982706
## 3   1.4236475  0.6851477     0.5277822     0.3755299  0.5682731  0.2091786
##   DiabetesPedigreeFunction        Age
## 1               0.06213071 -0.2737365
## 2              -0.11777023 -0.5428807
## 3               0.12892195  1.4057683
## 
## Clustering vector:
##   [1] 2 2 3 3 3 1 1 1 3 3 2 3 1 2 2 3 1 3 2 2 2 3 3 3 2 2 1 2 1 1 2 1 2 3 2
##  [36] 1 2 3 2 2 1 2 2 2 2 1 3 2 3 2 1 1 1 1 1 2 1 2 2 1 2 1 1 1 3 1 2 2 2 3
##  [71] 3 2 2 1 1 2 3 3 1 3 1 3 1 2 1 1 2 3 3 2 3 3 1 2 3 2 2 3 3 2 2 3 2 2 1
## [106] 3 2 3 2 3 2 2 2 1 1 2 3 3 3 3 1 2 1 1 1 1 1 3 1 1 1 3 1 2 2 2 2 1 2 3
## [141] 2 1 1 2 2 2 3 2 2 3 2 2 2 1 3 1 1 1 2 2 1 1 1 1 3 1 2 2 2 2 2 3 2 1 2
## [176] 2 2 2 1 1 1 1 1 2 1 2 1 2 3 1 1 1 2 2 2 2 2 2 2 2 3 3 3 1 1 1 2 3 1 2
## [211] 1 2 2 2 3 3 2 2 2 2 2 2 3 3 2 1 2 2 2 1 2 1 3 1 1 2 3 3 2 1 2 2 2 3 1
## [246] 2 2 1 3 1 2 2 1 1 2 3 3 1 2 1 2 2 3 1 2 2 2 1 3 3 2 1 1 2 2 1 2 2 1 2
## [281] 3 2 2 1 2 1 2 1 1 3 3 1 3 3 3 3 1 2 2 1 2 1 1 3 3 2 2 1 2 2 1 2 2 3 1
## [316] 1 1 1 1 1 2 1 2 2 3 2 2 3 3 1 1 1 1 2 3 2
## 
## Within cluster sum of squares by cluster:
## [1] 638.5022 554.0026 585.1043
##  (between_SS / total_SS =  33.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
kmp_model$centers
##   Pregnancies    Glucose BloodPressure SkinThickness    Insulin        BMI
## 1  -0.4193360  0.1838024     0.2100280     0.7312880  0.1782657  0.7516222
## 2  -0.4390313 -0.5118013    -0.4475874    -0.7719095 -0.4446160 -0.6982706
## 3   1.4236475  0.6851477     0.5277822     0.3755299  0.5682731  0.2091786
##   DiabetesPedigreeFunction        Age
## 1               0.06213071 -0.2737365
## 2              -0.11777023 -0.5428807
## 3               0.12892195  1.4057683
sil3 <- silhouette(kmp_model$cluster, dist(df_z))
summary(sil3)
## Silhouette of 336 units in 3 clusters from silhouette.default(x = kmp_model$cluster, dist = dist(df_z)) :
##  Cluster sizes and average silhouette widths:
##       113       145        78 
## 0.1127129 0.2835945 0.1335516 
## Individual silhouette widths:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.08993  0.09269  0.18783  0.19129  0.28566  0.45646
plot(sil3, col=1:length(kmp_model$size), border=NA)

n_rows <- 21
mat <- matrix(0,nrow = n_rows)
for (i in 2:n_rows){
  set.seed(1234)
  kmp_model <- kmeans(df_z, kpp_init(df_z, i), iter.max=100, algorithm='Lloyd')
  sil <- silhouette(kmp_model$cluster, dist(df_z))
  mat[i] <- mean(as.matrix(sil)[,3])
}
colnames(mat) <- c("Avg_Silhouette_Value")
mat
##       Avg_Silhouette_Value
##  [1,]            0.0000000
##  [2,]            0.2323945
##  [3,]            0.1912940
##  [4,]            0.1479250
##  [5,]            0.1344793
##  [6,]            0.1230041
##  [7,]            0.1398260
##  [8,]            0.1371855
##  [9,]            0.1380133
## [10,]            0.1380535
## [11,]            0.1231905
## [12,]            0.1307526
## [13,]            0.1276510
## [14,]            0.1270141
## [15,]            0.1262650
## [16,]            0.1226770
## [17,]            0.1185223
## [18,]            0.1159048
## [19,]            0.1172251
## [20,]            0.1165688
## [21,]            0.1156074
ggplot(data.frame(k=2:n_rows,sil=mat[2:n_rows]),aes(x=k,y=sil)) + geom_line() + scale_x_continuous(breaks = 2:n_rows)

k <- 2
set.seed(1234)
kmp2_model <- kmeans(df_z, kpp_init(df_z, k), iter.max=200, algorithm="MacQueen")
sil2 <- silhouette(kmp2_model$cluster, dist(df_z))
summary(sil2)
## Silhouette of 336 units in 2 clusters from silhouette.default(x = kmp2_model$cluster, dist = dist(df_z)) :
##  Cluster sizes and average silhouette widths:
##        142        194 
## 0.09200984 0.33515034 
## Individual silhouette widths:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.1282  0.1250  0.2309  0.2324  0.3793  0.5027
plot(sil2, col=1:length(kmp2_model$size), border=NA)

k <- 4
set.seed(1234)
kmp4_model <- kmeans(df_z, kpp_init(df_z, k), iter.max=200, algorithm="MacQueen")
sil4 <- silhouette(kmp4_model$cluster, dist(df_z))
summary(sil4)
## Silhouette of 336 units in 4 clusters from silhouette.default(x = kmp4_model$cluster, dist = dist(df_z)) :
##  Cluster sizes and average silhouette widths:
##        103        111         50         72 
## 0.14745426 0.20923738 0.02783948 0.13249982 
## Individual silhouette widths:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.15367  0.06759  0.15000  0.14686  0.22068  0.39826
plot(sil4, col=1:length(kmp4_model$size), border=NA)

k <- 7
set.seed(1234)
kmp7_model <- kmeans(df_z, kpp_init(df_z, k), iter.max=200, algorithm="MacQueen")
sil7 <- silhouette(kmp7_model$cluster, dist(df_z))
summary(sil7)
## Silhouette of 336 units in 7 clusters from silhouette.default(x = kmp7_model$cluster, dist = dist(df_z)) :
##  Cluster sizes and average silhouette widths:
##         45         41         35         59         20         50 
## 0.15087837 0.13186039 0.04067210 0.11255464 0.07534761 0.16593818 
##         86 
## 0.18901895 
## Individual silhouette widths:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.12311  0.05225  0.12866  0.13786  0.22776  0.41285
plot(sil7, col=1:length(kmp7_model$size), border=NA)

k <- 8
set.seed(1234)
kmp8_model <- kmeans(df_z, kpp_init(df_z, k), iter.max=200, algorithm="MacQueen")
sil8 <- silhouette(kmp8_model$cluster, dist(df_z))
summary(sil8)
## Silhouette of 336 units in 8 clusters from silhouette.default(x = kmp8_model$cluster, dist = dist(df_z)) :
##  Cluster sizes and average silhouette widths:
##         55         42         25         60         19         45 
## 0.12746487 0.12542888 0.08668730 0.10523802 0.05696327 0.18514059 
##         34         56 
## 0.12806721 0.23459332 
## Individual silhouette widths:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.13982  0.04988  0.13604  0.14186  0.22597  0.43003
plot(sil8, col=1:length(kmp8_model$size), border=NA)

k <- 11
set.seed(1234)
kmp11_model <- kmeans(df_z, kpp_init(df_z, k), iter.max=200, algorithm="MacQueen")
sil11 <- silhouette(kmp11_model$cluster, dist(df_z))
summary(sil11)
## Silhouette of 336 units in 11 clusters from silhouette.default(x = kmp11_model$cluster, dist = dist(df_z)) :
##  Cluster sizes and average silhouette widths:
##          42          60          26          25          11          29 
##  0.17941366  0.23337868  0.05755744  0.09833683 -0.02482009  0.12000844 
##          23          15          26          38          41 
##  0.12526127  0.11506333  0.11742627  0.11311238  0.13706808 
## Individual silhouette widths:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.19813  0.06575  0.13136  0.13773  0.22522  0.44301
plot(sil11, col=1:length(kmp11_model$size), border=NA)

cat("\nFrom the above results, the best value for the K parameter would be 7.")
## 
## From the above results, the best value for the K parameter would be 7.

Using the Generalized Linear Model for Performing Logistic Regression –

glm_model <- glm(Outcome~., family = binomial(link = 'logit'), data = train)
summary(glm_model)
## 
## Call:
## glm(formula = Outcome ~ ., family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7014  -0.5429  -0.2738   0.5669   2.9094  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -12.721212   1.945110  -6.540 6.15e-11 ***
## Pregnancies                0.119194   0.075677   1.575 0.115249    
## Glucose                    0.037403   0.007707   4.853 1.21e-06 ***
## BloodPressure              0.015244   0.017823   0.855 0.392405    
## SkinThickness              0.006066   0.023429   0.259 0.795700    
## Insulin                    0.001375   0.001910   0.720 0.471642    
## BMI                        0.090099   0.042611   2.114 0.034478 *  
## DiabetesPedigreeFunction   2.017052   0.595117   3.389 0.000701 ***
## Age                        0.032996   0.024002   1.375 0.169215    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 291.22  on 234  degrees of freedom
## Residual deviance: 181.56  on 226  degrees of freedom
## AIC: 199.56
## 
## Number of Fisher Scoring iterations: 5
glm_pred <- predict(glm_model, test, type = "response")
table(glm_pred>0.5, test$Outcome)
##        
##         Negative Positive
##   FALSE       57       19
##   TRUE         6       19
glm_acc <- (table(glm_pred>0.5, test$Outcome)[1] + table(glm_pred>0.5, test$Outcome)[4]) / (table(glm_pred>0.5, test$Outcome)[1] + table(glm_pred>0.5, test$Outcome)[2] + table(glm_pred>0.5, test$Outcome)[3] + table(glm_pred>0.5, test$Outcome)[4])
glm_speci <- table(glm_pred>0.5, test$Outcome)[4] / (table(glm_pred>0.5, test$Outcome)[2] + table(glm_pred>0.5, test$Outcome)[4])
glm_sensi <- table(glm_pred>0.5, test$Outcome)[1] / (table(glm_pred>0.5, test$Outcome)[1] + table(glm_pred>0.5, test$Outcome)[3])
cat("\nAccuracy:",glm_acc)
## 
## Accuracy: 0.7524752
cat("\nSpecificity:",glm_speci)
## 
## Specificity: 0.76
cat("\nSensitivity:",glm_sensi)
## 
## Sensitivity: 0.75

Using the Gradient Boosting Method for Modeling –

gbm_model <- gbm(Outcome~., data = train, distribution = "gaussian", n.trees = 10000, shrinkage = 0.01, interaction.depth = 4, bag.fraction = 0.5, train.fraction = 0.5, n.minobsinnode = 10, cv.folds = 3, keep.data = TRUE, verbose = FALSE, n.cores = 1)
best_iteration <- gbm.perf(gbm_model, method = "cv", plot.it = FALSE)
fit_control <- trainControl(method = "cv", number = 5, returnResamp = "all")
gbm_final_model <- train(Outcome~., data = train, method = "gbm", distribution = "bernoulli", trControl = fit_control, verbose = F, tuneGrid = data.frame(.n.trees = best_iteration, .shrinkage = 0.01, .interaction.depth = 1, .n.minobsinnode = 1))
gbm_pred <- predict(gbm_final_model, test)
cm_gbm_orig <- confusionMatrix(table(gbm_pred, test$Outcome))
cm_gbm_orig
## Confusion Matrix and Statistics
## 
##           
## gbm_pred   Negative Positive
##   Negative       55       18
##   Positive        8       20
##                                          
##                Accuracy : 0.7426         
##                  95% CI : (0.646, 0.8244)
##     No Information Rate : 0.6238         
##     P-Value [Acc > NIR] : 0.007895       
##                                          
##                   Kappa : 0.4213         
##  Mcnemar's Test P-Value : 0.077556       
##                                          
##             Sensitivity : 0.8730         
##             Specificity : 0.5263         
##          Pos Pred Value : 0.7534         
##          Neg Pred Value : 0.7143         
##              Prevalence : 0.6238         
##          Detection Rate : 0.5446         
##    Detection Prevalence : 0.7228         
##       Balanced Accuracy : 0.6997         
##                                          
##        'Positive' Class : Negative       
## 

Using a Neural Network Model (Error Decreases with Increase in # Hidden Nodes and Layers) –

normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}

nn_train <- as.data.frame(lapply(train[,-9], normalize))
nn_train$Outcome <- ifelse(train$Outcome == "Positive", 1, 0)
nn_test <- as.data.frame(lapply(test[,-9], normalize))
nn_test$Outcome <- ifelse(test$Outcome == "Positive", 1, 0)

nn_model <- neuralnet(Outcome~Pregnancies+Glucose+BloodPressure+SkinThickness+Insulin+BMI+DiabetesPedigreeFunction+Age, data = nn_train, hidden = 1, stepmax = 1e6)
plot(nn_model, rep = "best")

nn_pred <- compute(nn_model, nn_test[,-9])
pred_results <- nn_pred$net.result
cor(pred_results, nn_test$Outcome)
##              [,1]
## [1,] 0.4844594571
nnp_model <- neuralnet(Outcome~Pregnancies+Glucose+BloodPressure+SkinThickness+Insulin+BMI+DiabetesPedigreeFunction+Age, data = nn_train, hidden = 10, stepmax = 1e6)
plot(nnp_model, rep = "best")

nnp_pred <- compute(nnp_model, nn_test[,-9])
predp_results <- nnp_pred$net.result
cor(predp_results, nn_test$Outcome)
##             [,1]
## [1,] 0.392176431
nnph_model <- neuralnet(Outcome~Pregnancies+Glucose+BloodPressure+SkinThickness+Insulin+BMI+DiabetesPedigreeFunction+Age, data = nn_train, hidden = c(10,10,10), stepmax = 1e6)
plot(nnph_model, rep = "best")

nnph_pred <- compute(nnph_model, nn_test[,-9])
predph_results <- nnph_pred$net.result
cor(predph_results, nn_test$Outcome)
##              [,1]
## [1,] 0.2545543719

Using a Support Vector Machine Model (Radial, Linear & Laplacian) –

set.seed(1234)
svm_rbf_model <- ksvm(Outcome~., data = train, kernel = "rbfdot")
svm_rbf_model
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.110362872848112 
## 
## Number of Support Vectors : 130 
## 
## Objective Function Value : -90.3376 
## Training error : 0.13617
svm_rbf_pred <- predict(svm_rbf_model, test)
cm_svm_rbf <- confusionMatrix(table(svm_rbf_pred, test$Outcome))
cm_svm_rbf
## Confusion Matrix and Statistics
## 
##             
## svm_rbf_pred Negative Positive
##     Negative       53       19
##     Positive       10       19
##                                                
##                Accuracy : 0.7128713            
##                  95% CI : (0.6143106, 0.798545)
##     No Information Rate : 0.6237624            
##     P-Value [Acc > NIR] : 0.03857959           
##                                                
##                   Kappa : 0.3580977            
##  Mcnemar's Test P-Value : 0.13739483           
##                                                
##             Sensitivity : 0.8412698            
##             Specificity : 0.5000000            
##          Pos Pred Value : 0.7361111            
##          Neg Pred Value : 0.6551724            
##              Prevalence : 0.6237624            
##          Detection Rate : 0.5247525            
##    Detection Prevalence : 0.7128713            
##       Balanced Accuracy : 0.6706349            
##                                                
##        'Positive' Class : Negative             
## 
set.seed(1234)
svm_linear_model <- ksvm(Outcome~., data = train, kernel = "vanilladot")
##  Setting default kernel parameters
svm_linear_model
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 103 
## 
## Objective Function Value : -98.9642 
## Training error : 0.187234
svm_linear_pred <- predict(svm_linear_model, test)
cm_svm_linear <- confusionMatrix(table(svm_linear_pred, test$Outcome))
cm_svm_linear
## Confusion Matrix and Statistics
## 
##                
## svm_linear_pred Negative Positive
##        Negative       54       19
##        Positive        9       19
##                                                 
##                Accuracy : 0.7227723             
##                  95% CI : (0.6248177, 0.8072313)
##     No Information Rate : 0.6237624             
##     P-Value [Acc > NIR] : 0.02376890            
##                                                 
##                   Kappa : 0.376818              
##  Mcnemar's Test P-Value : 0.08897301            
##                                                 
##             Sensitivity : 0.8571429             
##             Specificity : 0.5000000             
##          Pos Pred Value : 0.7397260             
##          Neg Pred Value : 0.6785714             
##              Prevalence : 0.6237624             
##          Detection Rate : 0.5346535             
##    Detection Prevalence : 0.7227723             
##       Balanced Accuracy : 0.6785714             
##                                                 
##        'Positive' Class : Negative              
## 
set.seed(1234)
svm_laplace_model <- ksvm(Outcome~., data = train, kernel = "laplacedot")
svm_laplace_model
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Laplace kernel function. 
##  Hyperparameter : sigma =  0.110362872848112 
## 
## Number of Support Vectors : 135 
## 
## Objective Function Value : -103.7358 
## Training error : 0.140426
svm_laplace_pred <- predict(svm_laplace_model, test)
cm_svm_lapl <- confusionMatrix(table(svm_laplace_pred, test$Outcome))
cm_svm_lapl
## Confusion Matrix and Statistics
## 
##                 
## svm_laplace_pred Negative Positive
##         Negative       54       18
##         Positive        9       20
##                                                 
##                Accuracy : 0.7326733             
##                  95% CI : (0.6353758, 0.8158651)
##     No Information Rate : 0.6237624             
##     P-Value [Acc > NIR] : 0.01401434            
##                                                 
##                   Kappa : 0.4023669             
##  Mcnemar's Test P-Value : 0.12365771            
##                                                 
##             Sensitivity : 0.8571429             
##             Specificity : 0.5263158             
##          Pos Pred Value : 0.7500000             
##          Neg Pred Value : 0.6896552             
##              Prevalence : 0.6237624             
##          Detection Rate : 0.5346535             
##    Detection Prevalence : 0.7128713             
##       Balanced Accuracy : 0.6917293             
##                                                 
##        'Positive' Class : Negative              
## 

Comparing the Accuracy, Sensitivity and Specificity of the Models –

tab_results <- data.frame(
  Predictive_Model = c("Original C5.0", "Tuned 5.0", "Original R-PART", "Tuned R-PART", "One R Model", "JRip Model", "Original Naive Bayes", "Laplacian Naive Bayes", "Classification Tree", "LDA Model", "Original Random Forest", "Tuned Random Forest", "Logistic Regression", "Gradient Boosting Model", "Gaussian SVM", "Laplacian SVM"), 
  Accuracy = c(round(cm_c5_orig$overall[1],6), round(cm_c5_boost$overall[1],6), round(cm_rp_orig$overall[1],6), round(cm_rp_tune$overall[1],6), round(cm_oneR_orig$overall[1],6), round(cm_jrip_orig$overall[1],6), round(cm_nb_orig$overall[1],6), round(cm_nb_lapl$overall[1],6), round(cm_ctree_orig$overall[1],6), round(cm_lda_orig$overall[1],6), round(cm_rf_orig$overall[1],6), round(cm_rf_tune$overall[1],6), round(glm_acc,6), round(cm_gbm_orig$overall[1],6), round(cm_svm_rbf$overall[1],6), round(cm_svm_lapl$overall[1],6)), 
  Sensitivity = c(round(cm_c5_orig$table[1]/(cm_c5_orig$table[1]+cm_c5_orig$table[3]), 6), 
                  round(cm_c5_boost$table[1]/(cm_c5_boost$table[1]+cm_c5_boost$table[3]), 6),
                  round(cm_rp_orig$table[1]/(cm_rp_orig$table[1]+cm_rp_orig$table[3]), 6),
                  round(cm_rp_tune$table[1]/(cm_rp_tune$table[1]+cm_rp_tune$table[3]), 6),
                  round(cm_oneR_orig$table[1]/(cm_oneR_orig$table[1]+cm_oneR_orig$table[3]), 6),
                  round(cm_jrip_orig$table[1]/(cm_jrip_orig$table[1]+cm_jrip_orig$table[3]), 6),
                  round(cm_nb_orig$table[1]/(cm_nb_orig$table[1]+cm_nb_orig$table[3]), 6),
                  round(cm_nb_lapl$table[1]/(cm_nb_lapl$table[1]+cm_nb_lapl$table[3]), 6),
                  round(cm_ctree_orig$table[1]/(cm_ctree_orig$table[1]+cm_ctree_orig$table[3]), 6),
                  round(cm_lda_orig$table[1]/(cm_lda_orig$table[1]+cm_lda_orig$table[3]), 6),
                  round(cm_rf_orig$table[1]/(cm_rf_orig$table[1]+cm_rf_orig$table[3]), 6),
                  round(cm_rf_tune$table[1]/(cm_rf_tune$table[1]+cm_rf_tune$table[3]), 6),
                  round(glm_sensi, 6),
                  round(cm_gbm_orig$table[1]/(cm_gbm_orig$table[1]+cm_gbm_orig$table[3]), 6),
                  round(cm_svm_rbf$table[1]/(cm_svm_rbf$table[1]+cm_svm_rbf$table[3]), 6),
                  round(cm_svm_lapl$table[1]/(cm_svm_lapl$table[1]+cm_svm_lapl$table[3]), 6)
                  ), 
  Specificity = c(round(cm_c5_orig$table[4]/(cm_c5_orig$table[2]+cm_c5_orig$table[4]), 6), 
                  round(cm_c5_boost$table[4]/(cm_c5_boost$table[2]+cm_c5_boost$table[4]), 6),
                  round(cm_rp_orig$table[4]/(cm_rp_orig$table[2]+cm_rp_orig$table[4]), 6),
                  round(cm_rp_tune$table[4]/(cm_rp_tune$table[2]+cm_rp_tune$table[4]), 6),
                  round(cm_oneR_orig$table[4]/(cm_oneR_orig$table[2]+cm_oneR_orig$table[4]), 6),
                  round(cm_jrip_orig$table[4]/(cm_jrip_orig$table[2]+cm_jrip_orig$table[4]), 6),
                  round(cm_nb_orig$table[4]/(cm_nb_orig$table[2]+cm_nb_orig$table[4]), 6),
                  round(cm_nb_lapl$table[4]/(cm_nb_lapl$table[2]+cm_nb_lapl$table[4]), 6),
                  round(cm_ctree_orig$table[4]/(cm_ctree_orig$table[2]+cm_ctree_orig$table[4]), 6),
                  round(cm_lda_orig$table[4]/(cm_lda_orig$table[2]+cm_lda_orig$table[4]), 6),
                  round(cm_rf_orig$table[4]/(cm_rf_orig$table[2]+cm_rf_orig$table[4]), 6),
                  round(cm_rf_tune$table[4]/(cm_rf_tune$table[2]+cm_rf_tune$table[4]), 6),
                  round(glm_speci, 6),
                  round(cm_gbm_orig$table[4]/(cm_gbm_orig$table[2]+cm_gbm_orig$table[4]), 6),
                  round(cm_svm_rbf$table[4]/(cm_svm_rbf$table[2]+cm_svm_rbf$table[4]), 6),
                  round(cm_svm_lapl$table[4]/(cm_svm_lapl$table[2]+cm_svm_lapl$table[4]), 6)
                  )
)
  
kable(tab_results, "html") %>%
  kable_styling(bootstrap_options = "striped", font_size = 12) %>%
  row_spec(c(2,8,10,11,12,13), bold = TRUE, color = "white", background = "green") %>%
  row_spec(7, bold = TRUE, color = "white", background = "blue")
Predictive_Model Accuracy Sensitivity Specificity
Original C5.0 0.712871 0.723684 0.680000
Tuned 5.0 0.772277 0.812500 0.702703
Original R-PART 0.722772 0.733333 0.692308
Tuned R-PART 0.742574 0.768116 0.687500
One R Model 0.712871 0.736111 0.655172
JRip Model 0.702970 0.761905 0.605263
Original Naive Bayes 0.970297 0.983871 0.948718
Laplacian Naive Bayes 0.792079 0.828125 0.729730
Classification Tree 0.732673 0.790323 0.641026
LDA Model 0.752475 0.750000 0.760000
Original Random Forest 0.752475 0.756757 0.740741
Tuned Random Forest 0.762376 0.767123 0.750000
Logistic Regression 0.752475 0.750000 0.760000
Gradient Boosting Model 0.742574 0.753425 0.714286
Gaussian SVM 0.712871 0.736111 0.655172
Laplacian SVM 0.732673 0.750000 0.689655
col <- c("yellow", "darkblue")
par(mfrow=c(2,2))

fourfoldplot(cm_c5_orig$table, color = col, conf.level = 0, margin = 1, main=paste("Original C5.0 (",round(cm_c5_orig$overall[1]*100),"%)",sep=""))
fourfoldplot(cm_c5_boost$table, color = col, conf.level = 0, margin = 1, main=paste("Tuned C5.0 (",round(cm_c5_boost$overall[1]*100),"%)",sep=""))

fourfoldplot(cm_rp_orig$table, color = col, conf.level = 0, margin = 1, main=paste("Original R-PART (",round(cm_rp_orig$overall[1]*100),"%)",sep=""))
fourfoldplot(cm_rp_tune$table, color = col, conf.level = 0, margin = 1, main=paste("Tuned R-PART (",round(cm_rp_tune$overall[1]*100),"%)",sep=""))

fourfoldplot(cm_oneR_orig$table, color = col, conf.level = 0, margin = 1, main=paste("One R Model (",round(cm_oneR_orig$overall[1]*100),"%)",sep=""))
fourfoldplot(cm_jrip_orig$table, color = col, conf.level = 0, margin = 1, main=paste("JRip Model (",round(cm_jrip_orig$overall[1]*100),"%)",sep=""))

fourfoldplot(cm_nb_orig$table, color = col, conf.level = 0, margin = 1, main=paste("Original Naive Bayes (",round(cm_nb_orig$overall[1]*100),"%)",sep=""))
fourfoldplot(cm_nb_lapl$table, color = col, conf.level = 0, margin = 1, main=paste("Laplacian Naive Bayes (",round(cm_nb_lapl$overall[1]*100),"%)",sep=""))

fourfoldplot(cm_ctree_orig$table, color = col, conf.level = 0, margin = 1, main=paste("Classification Tree (",round(cm_ctree_orig$overall[1]*100),"%)",sep=""))
fourfoldplot(cm_lda_orig$table, color = col, conf.level = 0, margin = 1, main=paste("LDA Model (",round(cm_lda_orig$overall[1]*100),"%)",sep=""))

fourfoldplot(cm_rf_orig$table, color = col, conf.level = 0, margin = 1, main=paste("Original Random Forest (",round(cm_rf_orig$overall[1]*100),"%)",sep=""))
fourfoldplot(cm_rf_tune$table, color = col, conf.level = 0, margin = 1, main=paste("Tuned Random Forest (",round(cm_rf_tune$overall[1]*100),"%)",sep=""))

fourfoldplot(table(glm_pred>0.5, test$Outcome), color = col, conf.level = 0, margin = 1, main=paste("Logistic Regression (",round(glm_acc*100),"%)",sep=""))
fourfoldplot(cm_gbm_orig$table, color = col, conf.level = 0, margin = 1, main=paste("Gradient Boosting Model (",round(cm_gbm_orig$overall[1]*100),"%)",sep=""))

fourfoldplot(cm_svm_rbf$table, color = col, conf.level = 0, margin = 1, main=paste("Gaussian SVM (",round(cm_svm_rbf$overall[1]*100),"%)",sep=""))
fourfoldplot(cm_svm_lapl$table, color = col, conf.level = 0, margin = 1, main=paste("Laplacian SVM (",round(cm_svm_lapl$overall[1]*100),"%)",sep=""))

DISCUSSION OF RESULTS

## IMPORTANT NOTE: The BEST PERFORMING model has been highlighted in BLUE COLOR amongst the TOP PERFORMING models that have been highlighted in GREEN COLOR.
## 
## From the above results, we can conclude that the Naive Bayes Model (without Laplacian smoothing) performs the best with a staggering 97% accuracy as compared to the other models. Surprisingly, Laplacian smoothing causes the accuracy of this model to decrease to 79% with a parameter value of 50. The second-best model from the above list is the Tuned C5.0 model, which has an accuracy of 77%. The third-best model is the Tuned Random Forest Model, having an accuracy of 76%. The models with the highest accuracy are considered to be the best [13].
## 
## In regard to the sensitivity of the models, the Naive Bayes Model (without Laplacian smoothing) has the highest sensitivity of 96% followed by the Linear Discriminant Analysis (LDA) Model having a sensitivity of 90%. The Random Forest Model ranks third with a sensitivity of 88%. The higher the sensitivity of a model, the larger is its true positive rate and the better is its recall. The Type-I Error of such a model will likely be low [13].
## 
## With respect to the specificity of the models, the Naive Bayes Model (without Laplacian smoothing) has the highest specificity of 97% followed by the Logistic Regression Model having a specificity of 76%. The Naive Bayes Model (with a laplacian smoothing parameter = 50) ranks third with a specificity of 71%. The higher the specificity of a model, the larger is its true negative rate. The Type-II Error of such a model will likely be low [13].

CONCLUSION

## From the above discussion, we can conclude that the Naive Bayes Model without Laplacian smoothing is best-suited for this bioinformatics application. This is because this model has the highest accuracy, sensitivity and specificity amongst all of the above models. Indirectly, this model would have the lowest Type-I and Type-II errors too. Since we know the factors that can contribute to diabetes from the exploratory analysis, we can successfully build a predictive model to detect the onset of diabetes mellitus in patients based on the patients' physical attributes and above results.

ACKNOWLEDGEMENTS

## I would like to sincerely thank Prof. Ivo D. Dinov for all his encouragement and support in enabling me excel in this course. The homework assignments and in-class activities allowed me to understand much of the material, which ultimately helped me implement this whole project on my own. I would highly recommend taking this class on Data Science and Predictive Analytics (DSPA) if you are a student or professional interested in advancing your knowledge about using R to perform exploratory analyses and implement machine learning algorithms on your own. Thank you very much for such good course material.

REFERENCES

## 1.  'About diabetes'. World Health Organization. Archived from the original on 31 March 2014. Retrieved 4 April 2014.
## 
## 2.  'Diabetes Fact sheet N°312'. WHO. October 2013. Archived from the original on 26 August 2013. Retrieved 25 March 2014.
## 
## 3.  'Update 2015'. ID F. International Diabetes Federation. p. 13. Archived from the original on 22 March 2016. Retrieved 21 March 2016.
## 
## 4.  Williams textbook of endocrinology (12th ed.). Elsevier/Saunders. pp. 1371–1435. ISBN 978-1-4377-0324-5.
## 
## 5.  Shi Y, Hu FB (June 2014). 'The global implications of diabetes and cancer'. Lancet. 383 (9933): 1947–8. doi:10.1016/S0140-6736(14)60886-2. PMID 24910221.
## 
## 6.  Vos T, Flaxman AD, Naghavi M, Lozano R, Michaud C, Ezzati M, et al. (December 2012). 'Years lived with disability (YLDs) for 1160 sequelae of 289 diseases and injuries 1990-2010: a systematic analysis for the Global Burden of Disease Study 2010'. Lancet. 380 (9859): 2163–96. doi:10.1016/S0140-6736(12)61729-2. PMID 23245607.
## 
## 7.  IDF DIABETES ATLAS (6th ed.). International Diabetes Federation. 2013. p. 7. ISBN 2930229853. Archived from the original (PDF) on 9 June 2014.
## 
## 8.  'Economic costs of diabetes in the U.S. in 2012'. Diabetes Care. 36 (4): 1033–46. April 2013. doi:10.2337/dc12-2625. PMC 3609540. Freely accessible. PMID 23468086.
## 
## 9.  Ron Kohavi; Foster Provost (1998). 'Glossary of terms'. Machine Learning. 30: 271–274.
## 
## 10. Pima Indians Diabetes - dataset by uci. (2017, August 16). Retrieved from https://data.world/uci/pima-indians-diabetes
## 
## 11. Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
## 
## 12. Dinov, I. (n.d.). Learning Modules. Retrieved from http://www.socr.umich.edu/people/dinov/courses/DSPA_Topics.html
## 
## 13. Yang, C., Zou, Y., Liu, J., & Mulligan, K. (2015). Predictive model evaluation for PHM. International Journal of Prognostics and Health Management, 5(2), 1-11. Retrieved April 18, 2018, from https://nparc.nrc-cnrc.gc.ca/eng/view/object/?id=dce076fe-03db-4d8c-b097-5ca015aa414d.